DOCUMENT RESUME 



ED 393 094 



CS 012 419 



AUTHOR 

TITLE 

INSTITUTION 



3PONS AGENCY 

REPORT NO 
PUB DATE 
NOTE 

PUB TYPE 



Mazzeo, John; And Others 

Technical Report of the NAEP 1994 Trial State 
Assessment Program in Reading. 

Educational Testing Service, Princeton, N.J.; 
National Assessment of Educational Progress, 
Princeton, NJ. 

National Center for Education Statistics (ED), 
Washington, DC. 

NCES-96-116 
Dec 95 

545p. ; For related documents, see ED 388 962-963. 
Collected Works - General (020) — Reports - 
Research/ Technical (143) 



EDRS PRICE 
DESCRIPTORS 

IDENTIFIERS 



MF02/PC22 Plus Postage. 

*Data Analysis; Grade 4; Intermediate Grades; 
^Trogram Design; Program Implementation; ^Reading 
Achievement; ^Reading Research; Test Items 
’^National Assessment of Educational Progress; *Trial 
State Assessment (NAEP) 



ABSTRACT 

This report provides a description of the design for 
the National Assessment of Educational Progress (NAEP) 1994 Trial 
State Assessment in Reading and gives an overview of the steps 
involved in the implementation of the program from the planning 
stages through to the analysis and reporting of data. The report does 
not provide the results of the assessment — rather, it provides 
information on how those results were derived. Chapters in the report 
are (1) "Overview: The Design, Implementation, and Analysis of the 
1994 Trial State Assessment Program in Reading" (John Mazzeo and 
Nancy L. Allen); (2) "Developing the Objectives, Cognitive Items, 
Background Questions, and Assessment Instruments" (Jay R. Campbell 
and Patricia L. Donahue); (3) "Sample Design and Selection" (James L. 
Green and others); (4) "State and School Cooperation and Field 
Administrat ion" (Nancy Caldwell and Mark Waksberg) ; (5) "Processing 
and Scoring Assessment Materials" (Patrick Bourgeacq and others); (6) 
"Creation of the Database, Quality Control of Data Entry, and 
Creation of the Database Products" (John J. Ferris and others); (7) 
"Weighting Procedures and Variance Estimation" (Mansour Fahimi and 
others); (8) "Theoretical Background and Philosophy of NAEP Scaling 
Procedures" (Eugene G. Johnson and others); (9) "Data Analysis and 
Scaling for the 1994 Trial State Assessment in Reading" (Nancy L. 
Allen and others) ; and (10) "Conventions Used in Reporting the 
Results of the 1994 Trial State Assessment in Reading" (John Mazzeo 
and Clyde M. Reese) . Contains 48 tables and 25 figures of data. 
Appendixes provide a list of participants in the objectives and item 
development process, a summary of part i cipat i on rates, conditioning 
variables and contrast codings , IRT parameters for reading items , 
information on Trial State Assessment reporting subgroups, and 
discussions of setting achievement levels, the effect of monitoring 
an assessment sessions in nonpublic schools, correction of the NAEP 
program documentation error, and the information weighting error. 

(RS) 



What is The Nation’s Report Card? 

THE NATION’S REPORT CARD, the National Assessment of Educational Progress (NAEP), is the only nationally representative and continuing 
assessment of what America’s students know and can do in various subject areas. Since 1969, assessments have been conducted periodically in 
reading, mathematics, science, writing, history/geography, and other fields. By making objective information on student performance available to 
policymakers at the national, state, and local levels, NAEP is an integral part of our nation’s evaluation of the condition and progress of education. 
Only information related to academic achievement is collected under this program. NAEP guarantees the privacy of individual students and their 
families. 

NAEP is a congressionally mandated project of the National Center for Education Statistics, the U.S. Department of Education. The 
Commissioner of Education Statistics is responsible, by law, for carrying out the NAEP project through competitive awards to qualified organiza- 
tions. NAEP reports directly to the Commissioner, who is also responsible for providing continuing reviews, including validation studies and 
solicitation of public comment, on NAEP’s conduct and usefulness. 

In 1988, Congress established the National Assessment Governing Board (NAGB) to formulate policy guidelines for NAEP. The Board is 
responsible for selecting the subject areas to be assessed from among those included in the National Education Goals; for setting appropriate student 
performance levels; for developing assessment objectives and test specifications through a national consensus ar proach; for designing the 
assessment methodology; for developing guidelines for reporting and disseminating NAEP results; for developing standards and procedures for 
interstate, regional, and national comparisons; for determining the appropriateness of test items and ensuring they are free from bias; and for taking 
actions to improve the form and use of the National Assessment 



The National Assessment Governing Board 



Honorable William T. Randall, Chair 
Commissioner of Education 
State Department of Education 
Denver, Colorado 

Mao’ Blanton 
Attorney 

Salisbury, North Carolina 

Honorable Evan Bayh 
Governor of Indiana 
Indianapolis, Indiana 

Patsy Cavazos 
Principal 

W.G. l^ve Elementary School 
Houston, Texas 

Honorable Naomi K. Cohen 
Former Representative 
State of Connecticut 
Hartford, Connecticut 

Charlotte A. Crabtree 
Professor of Education 
University of California 
Los Angeles, California 

Catherine L. Davidson 
Secondary Education Director 
Central Kitsap School District 
Silverdale, Washington 

James E. EUingson 
Fourth-grade Teacher 
Probstfield Elementary School 
Moorhead, Minnesota 

Chester E. Finn, Jr. 

John M. Olin Fellow 
Hudson Institute 
Washington, DC 



Michael J. Guerra 
Executive Director 
Secondary School Department 
National Catholic Education Association 
Washington, DC 

William (Jeny) Hume 
Chairman 

Basic American, Inc. 

San Francisco, California 

Jan B. Loveless 
Educational Consultant 
Jan B. Loveless & Associates 
Midland, Michigan 

Marilyn McConachie 
Local School Board Member 
Glenbrook High Schools 
Glenview, Illinois 

Honorable Stephen E. Merrill 
Governor of New Hampshire 
Concord, New Hampshire 

Jason Millman 

Prof, of Educational Research Methodology 
Cornell University 
Ithaca, New York 

Honorable Richard P. Mills 
Commissioner of Education 
New York State Department of Education 
Albany, New York 

V^IIiam J. Moloney 
Superintendent of Schools 
Calvert County Public Schools 
Prince Frederick, Maryland 



Mark D. Musick 
President 

Southern Regional Education Board 
Atlanta, Georgia 

Mitsugi Nakashima 

Hawaii State Board of Education 

Honolulu, Hawaii 

Michael T. Nettles 

Professor of Education & Public Policy 
University of Michigan 
Ann Arbor, Michigan 

Honorable Edgar D. Ross 
Attorney 

Christiansted, St. Croix 
U.S. Virgin Islands 

Fannie N. Simmons 
Mathematics Specialist 
Midlands Improving Math & Science Hub 
Columbia, South Carolina 

Marilyn A. Whirry 
Twelfth-grade English Teacher 
Mira Costa High School 
Manhattan Beach, California 

Sharon P. Robinson (ex-officio) 

Assistant Secretary 
Office of Educational Research 
and Improvement 
U.S. Department of Education 
Washington, DC 



Roy TVuby 

Executive Director, NAGB 
Washington, DC 



3 



NATIONAL CENTER FOR EDUCATION STATISTICS 

Technical Report of the 
NAEP 1994 Trial State Assessment 
Program in Reading 



John Mazzeo 
Nancy L. Allen 
Debra L. Kline 

in collaboration with 



Patrick B. Bourgeacq 
Mary Lyn Bourque 
Charles L. Brungardt 
John Burke 
Nancy Caldwell 
Jay R. Campbell 
Patricia L. Donahue 
Mansour Fahimi 
John J. Ferris 
David S. Freund 
James L. Green 
Eddie H.S. Ip 
Steven P. Isham 
Eugene G. Johnson 



Tillie Kennel 
Robert J. Mislevy 
Clyde M. Reese 
Linda L. Reynolds 
Timothy Robinson 
Alfred M. Rogers 
Keith F. Rust 
Patricia M. Garcia Steams 
Brent W. Studer 
Spencer Swinton 
Bradley J. Thayer 
Neal Thomas 
Mark M. Waksberg 
Lois H. Worthington 



with a Foreword by Gary W. Phillips 
December 1995 



THE NATION'S 
REPORT 
CARD 



MBfl 






Prepared by Educational Testing Service under contract 
witli the National Center for Education Statistics 

Office of Educational Research and Improvement 
U.S. Department of Education 



4 




U.S. Department of Education 

Richard W. Riley 
Secretary 

Office of Educational Research and Improvement 

Sharon R Robinson 
Assistant Secretary 

National Center for Education Statistics 

Jeanne E. Griffith 
Acting Commissioner 

Education Assessment Division 
Gary W, Phillips 
Associate Commissioner 



December 1995 



FOR MORE INFORMATION: 

For ordering information on this report, write: 

National Librar>' of Education 

Office of Educational Research and Improvement 

U.S. Department of Education 

555 New Jersey Avenue. N\V 

Washington. D.C. 20208-5641 

or call 1-800-424-1616 (in the Washington. D.C. metropolitan area 
call 202-219-1651). 

The work upon which this publication is based was performed for the 
National Center for Education Statistics, Office of Educational Research 
and Improvement, by Educational Testing Service. 

Educational Testing Service is an equal opportunity, affirmative action employer. 

Educational Testing Sen ice, ETS. and the ETS logo are registered trademarks of Educational Testing Sen' 



TECHNICAL REPORT 

OF THE NAEP 1994 TRIAL STATE ASSESSMENT PROGRAM 

IN READING 



TABLE OF CONTENTS 



List of Tables and Figures 




ix 


Acknowledgments 






xiii 


Foreword 




Gary W. Phillips 


xvii 


Chapter 1 


Overview; 


The Design, Implementation, and Analysis of the 1994 






Trial State Assessment Program in Reading 


1 






John Mazzeo and Nancy L, Allen 






1.1 


Overview 


1 




1.2 


Design of the Trial State Assessment in Reading 


6 




1.3 


Development of Reading Objectives, Items, and Background Questions 


7 




1.4 


Assessment Instruments 


9 




1.5 


The Sampling Design 


10 




1.6 


Field Administration 


10 




1.7 


Materials Processing and Database Creation 


11 




1.8 


The Trial State Assessment Data 


12 




1.9 


Weighting and Variance Estimation 


13 




1.10 


Preliminary Data Analysis 


13 




1.11 


Scaling the Assessment Items 


14 




1.12 


Linking the Trial State Results to the National Results 


15 




1.13 


Reporting the Trial State Assessment Results 


16 


Chapter 2 


Developing the Objectives, Cognitive Items, Background Questions, and 
Assessment Instruments 


19 






Jay R. Campbell and Patricia L. Donahue 






2.1 


Overview 


19 




2.2 


Framework and Assessment Design Principles 


20 




2.3 


Framework Development Process 


21 




2.4 


Framework for the Assessment 


22 




2.5 


Distribution of Assessment Items 


22 




2.6 


Developing the Cognitive Items 


27 




2.7 


Student Assessment Booklets 


29 




2.8 


Questionnaires 


30 






2.8.1 Student Questionnaires 


32 






2.8.2 Teacher, School, and lEP/LEP Student Questionnaires 


33 




2.9 


Development of Final Forms 


33 



111 



6 



ERIC 



Chapter 3 Sample Design and Selection 35 

James L Green, John Burke, and Keith F. Rust 

3.1 Overview 35 

3.2 Sample Selection for the 1993 Field Test 36 

3.2.1 Primary Sampling Units 37 

3.2.2 Selection of Schools and Students 37 

3.2.3 Assignment to Sessions for Different Subjects 38 

3.3 Sampling Frame for the 1994 Assessment 38 

3 3.1 Choice of School Sampling Frame 38 

3.3.2 Missing Stratificatic t 39 

3.3.3 In-scope Schools 41 

3.4 Stratification Within Jurisdictions 41 

3.4.1 Stratification Variables 41 

3.4.2 Urbanization Classification 41 

3.4.3 Minority Classification 42 

3.4.4 Metro Status 53 

3.4.5 School Type 53 

3.4.6 Median Household Income 53 

3.5 School Sample Selection for the 1994 Trial State Assessment 53 

3.5.1 Control of Overlap of School Samples for National 

Educational Studies 53 

3.5.2 Selection of Schools in Small Jurisdictions 60 

3.5.3 New School Selection 60 

3.5.4 Designating Monitor Status 61 

3.5.5 School Substitution and Participation 61 

3.6 Student Sample Selection 63 

Chapter 4 State and School Cooperation and Field Administration 71 

Nancy Caldwell and Mark Waksberg 

4.1 Overview 71 

4.2 The Field Test 71 

4.2.1 Conduct of the Field Test 71 

4.2.2 Results of the Field Test 72 

4.3 The 1994 Trial State Assessment 72 

4.3.1 Overview of Responsibilities 74 

4.3.2 Schedule of Data Collection Activities 75 

4.3.3 Preparation^ for the Trial State Assessment 76 

4.3.4 Monitoring of Assessment Activities 79 

4.3.5 School and Student Participation 79 

4.3.6 Results of the Observations 81 

Chapter 5 Processing and Scoring Assessment Materials 83 

Patrick Bourgeacq, Charles Brungardt, Pat Garcia, 

Tillie Kennel, Linda Reynolds, Tim Robinson, 

Brent Studer, and Brad Thayer 



iv 




7 




5.1 


Overview 


83 




5.1.1 


Innovations for 1994 


83 


5.2 


Printing 




86 




5.2.1 


Overview 


86 




5.2.2 


Trial State Assessment Printing 


86 


5.3 


Packaging and Shipping 


88 




5.3.1 


Distribution 


88 




5.3.2 


Short Shipment and Phones 


91 


5.4 


Processing 


92 




5.4.1 


Overview 


92 




5.4.2 


Document Receipt and Tracking 


92 




5.4.3 


Data Entry 


95 




5.4.4 


Data Validation 


98 




5.4.5 


Editing for Non-image and Key-entered Documents 


100 




5.4.6 


Data Validation and Editing of Image Processed Documents 


101 




5.4.7 


Data Transmission 


102 


5.5 


Professional Scoring 


102 




5.5.1 


Over/iew 


102 




5.5.2 


Training Paper Selection 


103 




5.5.3 


General Training Guidelines 


104 




5.5.4 


Table Leader Utilities and Reliability Reports 


105 




5.5.5 


Main and Trial State Reading Assessment 


106 




5.5.6 


Training for the Main and State Reading Assessment 


107 




5.5.7 


Scoring the Main and State Reading Assessment 


107 




5.5.8 


1992 Short-term Trend and Image/Paper Special Study 


108 




5.5.9 


Calibration Bridges 


108 




5.5.10 


The Performance Scoring Center 


no 


5.6 


Data Delivery 


110 


5.7 


Miscellaneous 


112 




5.7.1 


Storage of Documents 


112 




5.7.2 


Quality Control Documents 


112 




5.7.3 


Alert Analysis 


113 



Chapter 6 Creation of the Database, Quality Control of Data Entry, and 



Creation of the Database Products 


115 




John /. Ferris, David S. Freund, and Alfred M, Rogers 




6.1 


Overview 


115 


6.2 


Creation of the Database 


115 




6.2.1 


Merging Files into the Trial State / issment Database 


115 




6.2.2 


Creating the Master Catalog 


116 


6.3 


Quality Control Evaluation 


117 




6.3.1 


Student Data 


118 




6.3.2 


Teacher Questionnaires 


121 




6.3.3 


School Questionnaires 


121 




6.3.4 


lEP/LEP Student Questionnaires 


121 


6.4 


NAEP Database Products 


121 




6.4.1 


The Item Information Database 


122 




6.4.2 


The Secondary-use Data Files 


122 






V 

" b 





O 

ERIC 



b 



Chapter 7 Weighting Procedures and Variance Estimation 129 

Mansour Fahimi^ Keith F. Rust, and John Burke 

7.1 Overview 129 

12 Calculation of Base Weights 130 

7.2.1 Calculation of School Base Weights 130 

12.2 Weighting New Schools 131 

7.23 Treatment of New and Substitute Schools 131 

7.2.4 Calculation of Student Base Weights 131 

7.3 Adjustments for Nonresponse 132 

7.3.1 Defining Initial School-level Nonresponse Adjustment Classes 132 

7.3.2 Constructing the Final Nonresponse Adjustment Classes 133 

7.3.3 School Nonresponse Adjustment Factors 134 

7.3.4 Student Nonresponse Adjustment Classes 135 

7.3.5 Student Nonresponse Adjustments 135 

7.4 Characteristics of Nonresponding Schools and Students 136 

7.4.1 Weighted Distributions of Schools Before and 

After School Nonresponse 138 

7.4.2 Characteristics of Schools Related to Response 141 

7.4.3 Weighted Distributions of Students Before and After Student 

Absenteeism 144 

7.5 Variation in Weights 147 

7.6 Calculation of Replicate Weights 149 

7.6.1 Defining Replicate Groups and Forming Replicates 

for Variance Estimation 150 

7.6.2 School-level Replicate Weights 151 

7.6.3 Student-level Replicate Weights 153 

Chapter 8 Theoretical Background and Philosophy of NAEP 

Scaling Procedures 155 

Eugene G. Johnson, Robert J, Mislevy, and Neal Thomas 

8.1 Overview 155 

8.2 Background 155 

8.3 Scaling Methodologj' 157 

8.3.1 The Scaling Models 157 

8.3.2 An Overview of Plausible Values Methodology 161 

8.33 Computing Plausible Values in IRT-based Scales 163 

8.4 NAGB Achievement Levels 164 

8.5 Analyses 165 

8.5.1 Computational Procedures 165 

8.5.2 Statistical Tests 166 

8.5.3 Biases in Secondary Analyses 167 

Chapter 9 Data Analysis and Scaling for the 1994 Trial State Assessment in Reading 169 

Nancy L. Allen, John Mazieo, Eddie H. S. Ip, Spencer Swinton, 

Steven P. Ishatn, and Lois Worthington 

9.1 Overview 169 



vi 




9 



9.2 


Assessment Instruments and Scoring 


170 




9.2.1 Items, Booklets, and Administration 


170 




9.2.2 Scoring the Constructed-response Items 


170 




9.2.3 Instrument Validity Evidence 


172 


9.3 


Item Analyses 


172 




9.3.1 Conventional Item and Test Analyses 


172 




9.3.2 Differential Item Functioning (DIF) Analyses 


185 


9.4 


Item Response Theory (IRT) Scaling 


190 




9.4.1 Item Parameter Estimation 


195 


9.5 


Estimation of State and Subgroup Proficiency Distributions 


203 


9.6 


Linking State and National Scales 


210 


9.7 


Producing a Reading Composite Scale 


218 



Chapter 10 Conventions Used in Reporting the Results of the 1994 Trial State Assessment 

in Reading 

John Mazzeo and Clyde A/. Reese 



10.1 Overview 221 

10.2 Minimum School and Student Sample Sizes for 

Reporting Subgroup Results 224 

10.3 Estimates of Standard Errors with Large Mean Squared Errors 225 

10.4 Treatment of Missing Questionnaire Data 226 

10.5 Statistical Rules Used for Producing the Stale Reports 228 

10.5.1 Comparing Means and Proportions for Mutually Exclusive 

Groups of Students 228 

10.5.2 Multiple Comparison Procedures 230 

10.5.3 DeJermining the Highest and Lowest Scoring Groups from 

a Set of Ranked Groups 230 

10.5.4 Comparing 1994 and 1992 Results in Stale Report Tables 231 

10.5.5 Comparing Dependent Proportions 231 

10.5.6 Statistical Significance and Estimated Effect Sizes 232 

10.5.5 Descriptions of the Magnitude of Percentage 233 

10.6 Comparisons of 1994 and 1992 Results in the First Look 



Reporty the Reading Report Card^ and the Cross-State Data Compendium 233 



Appendix A 


Participants in the Objectives and Item Development Process 


243 


Appendix B 


Summary of Participation Rates 


251 


Appendix C 


Conditioning Variables and Contrast Codings 


275 


Appendix D 


IRT Parameters for Reading Items 


319 


Appendix E 


Trial State Assessment Reporting Subgroups; 

Composite and Derived Common Background Variables; 
Composite and Derived Reporting Variables 


325 



vii 



10 



Appendix F 


Setting the NAEP Achievement Levels for the 1994 
Reading Assessment 

Mary Lyn Bourque 


Appendix G 


The Effect of Monitoring on Assessment Sessions in Nonpublic Schools 
Eddie H, S. Ip and Nancy L, Allen 


Appendix H 


Correction of the NAEP Program Documentation Error 


Appendix I 


The Information Weighting Error 

Susan C Loomis, Luz Bay, and Wen^Hung Chen 



References Cited in Text 



LIST OF TABLES AND nCURES 



Table 1-1 Jurisdictions participating in the 1994 Trial State Assessment Program 



Figure 2-1 Description of reading stances 

2-2 Description of purposes for reading 
2-3 1992 and 1994 framework— Aspects of reading literacy 

Table 2-1 We ighting of the reading purpose scales on the composite reading scale 
2-2 Percentage distribution of items by reading stance 
2-3 Percentage of assessment time devoted to the reading stances 
2-4 Cognitive and noncognitive block information 
2-5 Booklet contents 



23 

24 

25 

26 
26 
29 
31 
31 



Table 



3-1 Number of sessions by grade and session type 

3-2 Distribution of fourth-grade schools and enrollment as reported in QED 1992 
3-3 Distribution of the selected public schools b> sampling strata 
3-4 Distribution of the selected nonpublic schools by sampling strata 
3-5 Distribution of new schools coming from large and small districts 
3-6 Substitute school counts for fourth-grade schools 
3-7 Distribution of the fourth-grade public-school sample by jurisdiction 
3-8 Distribution of the fourth-grade nonpublic-school sample by jurisdiction 
3-9 Distribution of the fourth-grade public-school student sample and response rates by 
jurisdiction 

3-10 Distribution of the fourth-grade nonpublic-school student sample and response rates 
by jurisdiction 



38 

40 

43 

54 

62 

64 

65 

66 

68 

69 



Figure 4-1 Map of participating jurisdictions, 1992 and 1994 assessments 
Table 4-1 School participation, 1994 Trial State Assessment 
4-2 Student participation, 1994 Trial State Assessment 



Table 5-1 
5-2 
5-3 

Figure 5-1 
Table 5-4 
Figure 5-2 
5-3 

Table 5-5 



5-6 



Trial State Assessment processing totals 
Trial State Assessment NCS schedule 
Documents printed for the Trial State Assessment 
Trial State Assessment materials distribution flow 
Trial State Assessment phone request summary 
Trial State Assessment materials processing flow chart 
Trial State Assessment image scanning flow chart 

Number of constructed-response items in each range of percentages of exact 
agreement between readers 

Alerts for 1994 national and Trial State assessments 



84 

85 
87 
89 
91 
93 
97 

106 

113 



Table 6-1 Number of reading booklets scanned and selected for quality control evaluation 
6-2 Inference from the quality control evaluation of grade 4 data 



119 

120 



Table 7-1 Unweighted and fmal weighted counts of assessed and excluded students by jurisdiction 
7-2 Weighted mean values derived from sampled public schools 
7-3 Weighted mean values derived from all sampled schools for jurisdictions achieving 
minimal required public- and nonpublic-school participation, before substitution 
7-4 Results of logistic regression analyses of school nonresponse 



137 

139 

140 
142 




M 



ix 



-s 




erIc 



7-5 Weighted studeut percentages derived from sampled public schools 145 

7-6 Weighted student percentages derived from all schools sampled 148 

Table 9-1 1994 NAEP reading block composition by scale and item type for grade 4 171 

9-2 Score range, percent agreement, and Cohen's Kappa for the short constructed- 

response reading items used in scaling 173 

9-3 Score range, percent agreement, and intraclass correlation for the extended 

constructed- response items used in scaling 174 

9-4 Descriptive statistics for each block of items by position within test booklet and 

overall, public school I 75 

9-5 Descriptive statistics for each block of items by position within test booklet and 

overall, nonpublic school 176 

9-6 Block-level descriptive statistics for unmonitored and monitored public-school sessions 179 

9-7 Block-level descriptive statistics for unmonitored and monitored nonpublic-school sessions 179 

Figure 9-1 Stem-and-leaf display of state-by-state differences in average item scores by scale in 

public schools ig\ 

9-2 Stem-and-leaf display of state-by-state differences in average item scores by scale in 

nonpublic schools 183 

Table 9-8 Frequency distributions of DIF statistics for grade 4 dichotomous items grouped by 

content area 188 

9-9 Frequency distributions of DIF statistics for grade 4 polytomous items grouped by 

content area 189 

Figure 9-3 Stem-and-leaf display of average item scores for public-school sessions 191 

9-4 Stem-and-leaf display of average item scores for nonpublic-school sessions 193 

Table 9-10 Extended constructed-response items 1 % 

Figure 9-5 Plots comparing empirical and model-based estimates of item response functions 

for binary-scored items exhibiting good model fit 199 

9-6 Plot comparing empirical and model-based estimates of item category characteristic 

curves for a polytomous item exhibiting good model fit 200 

9-7 Plot comparing empirical and model-based estimates of item response functions for 

binary-scored items exhibiting some model misfit 201 

9-8 Plot comparing empirical and model-based estimates of item category characteristic 

curves for a polytomous item exhibiting some model misfit 202 

9-9 Plot comparing empirical and model-based estimates of the item response function 
for item R012111 using 1992 assessment data before collapsing unsatisfactory and 
partial response categories 204 

9-10 Plot comparing empirical and model-based estimates of the item response function 
for item R012111 using 1992 assessment data after collapsing unsatisfactory and 
partial response categories 205 

9-11 Plot comparing empirical and model-based estimates of the item response function 
for item R012111 using 1994 assessment data after collapsing unsatisfactory and 
partial response categories 206 

9-12 Plot comparing empirical and model-based estimates of the item response function 
for item R015707 using 1994 assessment data before collapsing unsatisfactory and 
partial response categories 207 

9-13 Plot comparing empirical and model-based estimates of the item response function 
for item R015707 using 1994 assessment data after collapsing unsatisfactory and 
partial response categories 208 

Table 9-11 Summary statistics for Trial Stale Assessment conditioning models 211 

Figure 9-14 Plot of mean proficiency versus mean item score by jurisdiction 

Table 9-12 Transformation constants ^^4 

X 

13 



o 



Figure 9-15 Rootogram comparing proficiency distributions for the Trial State Assessment 
a ggr egate sample and the state aggregate comparison sample from the national 
assessment for the reading for literary experience scale 
9-16 Rootogram comparing proficiency distributions for the Trial State Assessment 
aggregate sample and the state aggregate comparison sample from the national 
assessment for the reading to gain information scale 
Table 9-13 Weights used for each scale to form the reading composite 
Figure 9-17 Rootogram comparing proficiency distributions for the Trial State Assessment 
aggr egate sample and the state aggregate comparison sample from the national 
assessment for the reading composite scale 



216 

217 

218 

219 



Table 10-1 Weighted percentage of fourth-grade students matched to teacher questionnaire 
Figure 10-1 Rules for descriptive terms in state reports 

Table 10-2 Average overall reading proficiency and percentage of students at achievement levels 
10-3 Percentage of students and average overall reading proficiency by race/ethnicity 



227 

233 

236 

239 



xi 




14 



ACKNOWLEDGMENTS 



The design, development, analysis, and reporting of the Trial State 
Assessment Program was truly a collaborative effort among staff from State 
Education Agencies, the National Center for Education Statistics (NCES), 
Educational Testing Service (ETS), Westat, and National Computer Systems (NCS). 
The program benefited from the contributions of hundreds of individuals at the 
state and local levels— Governors, Chief State School Officers, State and District 
Test Directors, State Coordinators, and district administrators— who tirelessly 
provided their wisdom, experience, and hard work. Finally, and most importantly, 
NAEP is grateful to the students and school staff who participated in the Trial State 
Assessment. 

This report documents the design and data analysis procedures behind the 
1994 Trial State Assessment in reading. It also provides insight into the rationale 
behind the technical decisions made about the program. The development of this 
Technical Report, and especially of the Trial State Assessment Progranri, is the 
culmination of effort by many individuals who contributed their considerable 
knowledge, experience, and creativity to the 1994 Trial State Assessment Program 
in reading. 

The 1994 Trial State Assessment was funded through the National Center 
of Education Statistics in the Office of Educational Research and Improvement of 
the U.S. Department of Education. Emerson Elliott, past NCES Commissioner, 
provided consistent support and guidance. NCES staff — particularly Jeanne 
Griffith, Gary Phillips, Steve Gorman, Peggy Carr, Sharif Shakrani, Susan Ahmed, 
Larry Ogle, Shi-Cheng Wu, Maureen Treacy, and Sheida White-worked closely 
and collegially with ETS, Westat, and NCS staff and played a crucial role in all 
aspects of the program. The technical report was reviewed by Jamal Abedi of 
UCLA and Ralph Lee, Marilyn Binkley, Sue Ahmed, and Shi-Chang Wu of NCES. 

The members of the National Assessment Governing Board (NAGB) and 
NAGB staff provided advice and guidance throughout, and their contractor, 
American College Testing, worked with various panels in setting the achievement 
levels, and carried out a variety of analyses related to the levels. NAGB s contractor 
for the reading consensus project, the Council of Chief State School Officers, worked 
diligently under tight time constraints to create the forward-thinking framework 
underlying the assessment. 

NAEP owes a great deal to the numerous panelists and consultants who 
worked so diligently on developing the assessment and providing a frame for 
interpreting the results, including those who helped create the objectives, develop the 
assessment instruments, set the achievement levels, and provide the anchoring 
descriptions. 



Under the NAEP contract to ETS, Archie Lapointe served as executive 
director and Paul Williams as project director. Steve Lazer managed test 
development activities, and Jay Campbell worked with the Reading Item 
Development committee to develop the assessment instruments. Jules Goodison 
managed the operational aspects together with John Olson. Eugene Johnson led the 
measurement and research efforts; John Barone directed data analysis activities. 
ETS management has been very supportive of NAEP’s technical work. Special 
thanks go to Nancy Cole and Ernie Anastasio as well as to Henry Braun and Charles 
Davis of ETS research management. 

The guidance of the NAEP Design and Analysis Committee on the technical 
aspects of NAEP has been outstanding. The members are Sylvia Johnson (chair), 
Albert Beaton, Jeri Benson, William Cooley, Jeremy Finn, Huynh Huynh, Gaea 
Leinhardt, David Lohman, Bengt Muth^n, Anthony Nitko, Ingram Olkin, Tej 
Pandey, and Juliet Shaffer. We were saddened by the untimely death of Clifford 
Clogg, whose service to NAEP will not be forgotten. 

Statistical and psychometric activities were led by Nancy Allen, Spencer 
Swinton, and Eddie Ip under the direction of Eugene Johnson, Jim Carlson, and 
John Mazzeo. Major contributions were also made by Huahua Chang, John 
Donoghue, Frank Jenkins, Jo-lin Liang, Eiji Muraki, and Neal Thomas. Robert 
Mislevy provided valuable statistical and psychometric advice. 

Under the leadership of John Barone, the division of Data Analysis and 
Technicology Research developed the operating systems and carried out the data 
analyses. David Freund and Alfred Rogers developed and maintained the large and 
complex NAEP data management systems; Kate Pashley managed database activities 
and Patricia O’Reilly was responsible for the restricted-use version of the data. 
Alfred Rogers developed the production versions of key analysis and scaling systems. 
Special thanks go to Steven Isham, who performed the reading data analyses, 
assisted by Lois Worthington. Laura Jerry led the computer-based development and 
production of the state reading reports and Jennifer Nelson produced the data 
compendium. They were assisted by Phillip Leung, Inge Novatkoski, Steven Isham, 
and David Freund. Alfred Rogers developed the report maps. Other members of 
this division who made important contributions to NAEP data analyses were Laura 
Jenkins, Michael Narcowich, Craig Pizzuti, and Minhwei Wang. 

The staff of Westat, Inc. contributed their exceptional talents in all areas of 
sample design and data collection. Field administration and data collection were 
carried out under the direction of Renee Slobasky and Nancy Caldwell. Keith Rust 
developed and supervised the sampling design. Mark Waksberg, Leslie Wallace, 
Debra Vivari, Dianne Walsh, Leyla Mohadjer, Adam Chu, Valerija Smith, and 
Jacqueline Sevei7nse undertook major roles in these activities. 

Critical to the program was the contribution of National Computer Systems, 
Inc. Printing, distribution, scoring, and processing of the assessment materials were 
carried out under the leadership of Judy Moyer and Brad Thayer, with additional 



contributions by Patrick Bourgeacq, Charles Brungardt, Patricia Garcia Stearns, 
Tillie Kennel, Linda L. Reynolds, Timothy Robinson, and Brent Studer. 

Jay Campbell, John Mazzeo, and Clyde Reese collaborated on the text for 
the assessment reports. Karen Miller coordinated the quality control and checking 
of the reports, much of which was carried out by Alice Kass, Yvette Hillard, and 
Christy Schwager. Thanks go also to the many individuals who reviewed the reports, 
especially the editors who improved the text and the data analysts who checked the 
accuracy of the data. 

Mary Michaels and Sharon David-Johnson coordinated the cover design and 
final production of this technical report. Debra Kline was responsible for organizing, 
scheduling, editing, and ensuring the accuracy of the final report. 



XV 





FOREWORD 



This technical report summarizes some of the most complex statistical 
methodology used in any survey or testing program in the United States. In its 25- 
year history, the National Assessment of Educational Progress (NAEP) has 
pioneered such state-of-the-art techniques as matrix sampling and item response 
theory models. Today it is the leading survey using the advanced plausible values 
methodology, which uses a multiple imputation procedure in a psychometric context. 

The 1994 Trial State Assessment in reading followed the same basic design 
as that used for the 1990 and 1992 Trial State Assessments in mathematics and 
reading. Properties of the 1994 reading assessment common to the 1990 and 1992 
assessments include: 1) continuing the use of focused-BIB spiraling, item response 
theory models, and plausible values; 2) keeping the national and Trial State 
Assessment samples and scales separate; 3) doing separate stratifications and 
conditioning in each of the state samples; 4) making each state sample have power 
similar to the regional samples from the national assessment (this is how the sample 
sizes for the states were determined); 5) equating state and national scales using the 
aggregate of the state samples and a national subsample that was representative of 
the aggregate of the states; and 6) using power rules and other statistical 
considerations to determine which subgroup comparisons were supported by 
sufficient school and student sample sizes. One new activity in the 1994 assessment 
was the inclusion of nonpublic schools at the state level. TTie goal was to make the 
state estimates more representative of the total student population and, where 
possible, provide state estimates for the nonpublic school subgroups. 

The 1994 Trial State Assessment provided many opportunities to test the 
limits of statistical theory and thereby advance the state of the art. Some examples 
include: 1) conditioning on a smaller set of principal components rather than a 
larger set of background variables and 2) the use of the two-parameter polytomous 
item response theory model for scaling constructed-response and extended 
constructed-response items. It is expected that in the future the conditioning models 
may be expanded in ways that will help secondary analysts who want to use 
hierarchical linear models as part of their statistical analysis procedures. 

The Trial State Assessment has many statistical challenges ahead that must 
be dealt with. As the NAEP project plans for the 1996 assessment, it must find ways 
to: 1) accurately report results for nonpublic schools (which have less well developed 
sampling frames); 2) provide accommodations and adaptations for students with 
disabilities and limited English proficiency; and, 3) provide reports to the States 
within a six-month period. The project can and will meet these challenges. 

The NAEP project is not only characterized by elegant statistical procedures, 
but it is also noted for the dedicated professionalism of its staff. It is the stubborn 



xvii 



O 




18 




insistence that surveys are scientific activities and relentless quest for improved 
methodology that have made NAEP credible for over two decades. 



Gary W. Phillips 
Associate Commissioner 
National Center for Education 
Statistics 



will 



Chapter 1 
OVERVIEW: 



THE DESIGN, IMPLEMENTATION, AND ANALYSIS OF THE 
1994 TRIAL STATE ASSESSMENT PROGRAM IN READING 



John Mazzeo and Nancy L. Allen 
Educational Testing Service 



The National Assessment shall conduct a 1994. . .trial reading assessment for the 4th 
grade, in states that wish to participate, with the purpose of determining whether such 
assessments yield valid and reliable State representative data. (Section 406 (i)(2)(C)(i) 
of the General Education Provisions Act, as amended by Pub. L. 103-33 (US.C. 1221e- 
l(a(2)(B)(iii))) 

The National Assessment shall include in each sample assessment. . .students in public 
and private schools in a manner that ensures comparability with the national sample. 
(Section 406(i)(2)(C)(i) of the General Education Provisions Act, as amended by Pub. 
L. 103-33 (U.S.C. 1221e-l(a)(2)(B)(iii))) 



1.1 OVERVIEW 

In April 1988, Congress reauthorized the National Assessment of Educational Progress 
(NAEP) and added a new dimension to the program — ^voluntary state-by-state assessments on a 
trial basis in 1990 and 1992, in addition to continuing the national assessments that NAEP had 
conducted since its inception. In this report, we will refer to the voluntary state-by-state 
assessment program as the Trial State Assessment Program. This program, which is designed to 
provide representative data on achievement for participating jurisdictions, is distinct from the 
assessment designed to provide nationally representative data, referred to in this report as the 
national assessment. (This terminology is also used in all other reports of the 1990, 1992, and 
1994 assessments.) It should be noted that the word trial in Trial State Assessment refers to the 
Congressionally mandated trial to determine whether such assessments can yield valid, reliable 
state representative data. All instruments and procedures used in the 1990, 1992, and 1994 Trial 
State and national assessments were previously piloted in field tests conducted in the year prior 
to each assessment. 

The 1990 Trial State Assessment Program collected information on the mathematics 
knowledge, skills, and understanding of a representative sample of eighth-grade students in 
public schools in 37 states, the District of Columbia, and two territories. The second phase of 




the Trial State Assessment Program, conducted in 1992, collected information on the 
mathematics knowledge, skills, and understanding of a representative sample of fourth- and 
eighth-grade students and the reading skills and understanding of a representative sample of 
fourth-grade students in public schools in 41 states, the District of Columbia, and two territories. 

The 1994 Trial State Assessment Program, described in this technical report, once again 
assessed the reading skills and understanding of representative samples of fourth-grade students 
in participating jurisdictions. The participation of jurisdictions in the Trial State Assessment has 
been, and continues to be, voluntary. The 1994 program broke new ground in two ways. The 
1994 NAEP authorization called for the assessment of samples of both public and private school 
students. Thus, for the first time in NAEP, jurisdiction-level samples of students from Catholic 
schools, other religious schools and private schools. Domestic Department of Defense Education 
Activity schools, and Bureau of Indian Affairs schools were added to the Trial State program. 
Second, samples of students from the Department of Defense Education Activity overseas 
schools participated as a jurisdiction, along with the states and territories that have traditionally 
had the opportunity to participate in Trial State Assessment Program. 

Table 1-1 lists the jurisdictions that participated in the 1994 Trial State Assessment 
Prograrn. More than 120,000 students at grade 4 participated in the reading assessment in those 
jurisdictions. Students were administered the same assessment booklets that were used in 
NAEP’s 1994 national grade 4 reading assessment. 

The reading framework that guided both the 1994 Trial State Assessment and the 1994 
national assessment is the same framework used for the 1992 NAEP assessments. Tbe 
framework was developed for NAEP through a consensus project of the Council of Chief State 
School Officers, funded by the National Assessment Governing Board. Hence, 1994 provides 
the first opportunity to report jurisdiction-level trend data for a NAEP reading instrument for 
those states and territories that participated in both the 1992 and 1994 Trial State Assessment 
programs. In addition, questionnaires completed by the students, their reading teachers, and 
principals or other school administrators provided an abundance of contextual data within which 
to interpret the reading results. 

The purpose of this report is to provide technical information about the 1994 Trial State 
Assessment in reading. It provides a description of the design for the Trial State Assessment 
and gives an overview of the steps involved in the implementation of the program from the 
planning stages through to the analysis and reporting of the data. The report describes in detail 
the development of the cognitive and background questions, the field procedures, the creation of 
the database and data products for analysis, and the methods and procedures used for sampling, 
analysis, and reporting. It does not provide the results of the assessment — rather, it provides 
iiiformation on how those results were derived. 

This report is one of several documents that provide technical information about the 
1994 Trial State Assessment. For those interested in performing their own analyses of the data, 
this report and the user guide for the secondary-use data should be used as primary sources of 
information about NAEP. Information for lay audiences is provided in the procedural 
appendices to the reading subject-area reports; theoretical information about the models and 
procedures used in NAEP can be found in the special NAEP-related issue of the Journal of 
Educational Statistics (Summer 1992/Volume 17, Number 2). 

2 




21 



Table 1-1 

Jurisdictions Participating in the 
1994 Trial State Assessment Program 



Jurisdictions 


Alabama 


Hawaii 


Mississippi 


Pennsylvania 


Arizona 


Idaho 


Missouri 


Rhode Island 


Arkansas 


Indiana 


Montana* 


South Carolina 


California 


Iowa 


Nebraska 


Tennessee 


Colorado 


Kentucky 


New Hampshire 


Texas 


Connecticut 


Louisiama 


New Jersey 


Utah 


Delaware 


Maine 


New Mexico 


Virginia 


DoDEA Overseas* 


Maryland 


New York 


Washington* 


District of Columbia** 


Massachusetts 


North Carolina 


West Virginia 


Florida 


Michigan 


North Dakota 


Wisconsin 


Georgia 


Minnesota 




Wyoming 


Guam 









* Washington, Montana, and DoDEA (Department of Defense Education Activity) overseas schools participated 
in the 1994 program but did not participate in the 1992 program. 

The Distort of Columbia participated in the testing portion of the 1994 Trial State Assessment Program, 
However, in accordance vAih the legislation providing for participants to review and give permission for release of 
their results, the District of Columbia chose not to publish their results in the reports. 



Educational Testing Service (ETS) was the contractor for the 1994 NAEP programs, 
including the Trial State Assessment. ETS was responsible for overall management of the 
programs as well as for development of the overall design, the items and questionnaires, data 
analysis, and reporting. National Computer Systems (NCS) was a subcontractor to ETS on both 
the national and Trial State NAEP programs. NCS was responsible for printing, distribution, 
and receipt of all assessment materials, and for scanning and professional scoring. All aspects of 
sampling and of field operations for both the national and Trial State Assessments were the 
responsibility of Westat, Inc. The National Center for Education Statistics contracted directly 
with Westat for these services for the national assessment. Westat was a subcontractor to ETS 
in providing sampling and field operations services for the Trial State Assessment. 



This technical report provides information about the technical bases for a series of 
reports that have been prepared for the 1994 Trial State Assessment Program in reading, 
including: 

o A State Report for each participating jurisdiction that describes the reading 
proficiency of the fourth-gradf public- and nonpublic-school students in that 
jurisdiction and relates their proficiency to contextual information about reading 
policies and instruction. 



• The report NAEP 1994 Reading: A First Look, which provides overall public- 
school results and results for major NAEP reporting subgroups for all of the 
jurisdictions that participated in the Trial State Assessment Program, as well 
as selected results from the 1994 national reading assessment. 

• The NAEP 1994 Reading Report Card for the Nation and the States, which provides 
Doth public- and nonpublic-school data for all of the jurisdictions that participated in 
the Trial State Assessment Program along with a more complete report of the results 
from the 1994 national reading assessment. 

• The Executive Summary of the NAEP 1994 Reading Report Card for the Nation and the 
States, providing the highlights of the Reading Report Card. 

• The Cross-State Data Compendium from the NAEP 1994 Reading Assessment, which 
includes jurisdiction-level results for all the demographic, instructional and 
experiential background variables included in the Reading Report Card and State 
Report. 

• Data Almanacs for each jurisdiction that contain a detailed breakdown of the reading 
proficiency data according to the responses to the student, teacher, and school 
questionnaires for the public-school, nonpublic-school, and combined populations as a 
whole and tor important subgroups of the public-school population. There are six 
sections to each almanac: 

A The Distribution Data Section provides information about the percentages of 
students at or above the three composite-scale achievement levels (and below 
basic). For the composite scale and each reading scale, this almanac also 
provides selected percentiles for the public-school, nonpublic-school, and 
combined populations and for the standard demographic subgroups of the 
public-school population. 

A The Student Questionnaire Section provides a breakdown of the composite 
scale proficiency data according to the students’ responses to questions in the 
three student questionnaires included in the assessment booklets. 

A The Teacher Questionnaire Section provides a breakdown of the composite- 
scale proficiency data according to the teachers’ responses to questions in the 
reading teacher questionnaire. 

A The School Questionnaire Section provides a breakdown of the composite-scale 
proficiency data according to the principals’ (or other administrators’) 
responses to questions in the school characteristics and policies questionnaire. 

A The Scale Section provides a breakdown of the proficiency data for the two 
reading scales (Reading for Literary Experience, Reading to Gain 
Information) according to selected items from the questionnaires. 



A The Reading Item Section provides the response data for each reading item in 
the assessment. 



Organization of the Technical Report 

This chapter provides a description of the design for the Trial State Assessment in 
reading and gives an overview of the steps involved in implementing the program from the 
planning stages to the analysis and reporting of the data. The chapter summarizes the major 
components of the program, with references to later chapters for more details. The 
organization of this chapter, and of the report, is as follows: 

• Section 1.2 provides an overview of the design of the 1994 Trial State Assessment 
Program in reading. 

• Section 1.3 summarizes the development of the reading objectives and the 
development and review of the items written to measure those objectives. Details 
are provided in Chapter 2. 

• Section 1.4 discusses the assignment of the cognitive questions to assessment 
booklets. An initial discussion is provided of the partially balanced incomplete block 
(PBIB) spiral design that was used to assign cognitive questions to assessment 
booklets and assessment booklets to individuals. A more complete description is 
provided in Chapter 2. 

• Section 1.5 outlines the sampling design used for the 1994 Trial State Assessment 
Program. A fuller description is provided in Chapter 3. 

• Section 1.6 summarizes Westat’s field administration procedures, including securing 
school cooperation, training administrators, administering the assessment, and 
conducting quality control. Further details appear in Chapter 4. 

• Section 1.7 describes the flow of the data from their receipt at National Computer 
Systems through data entry, professional scoring, and entry into the ETS/NAEP 
database for analysis, and the creation of data products for secondary users. 

Chapters 5 and 6 provide a detailed description of the process. 

• Section 1.8 provides an overview of the data obtained from the 1994 Trial State 
Assessment in reading. 

• Section 1.9 summarizes the procedures used to weight the assessment data and to 
obtain estimates of the sampling variability of subpopulation estimates. Chapter 7 
provides a full description of the weighting and variance estimation procedures. 

• Section 1.10 describes the initial analyses performed to verify the quality of the data 
in preparation for more refined analyses, with details given in Chapter 9. 

5 







24 



• Section 1.11 describes the item response theory subscales and the overall reading 
composite that were created for the primary analysis of the Trial State Assessment 
data. Further discussion of the theory and philosophy of the scaling technology 
appears in Chapter 8, with details of the scaling process in Chapter 9. 

• Section 1.12 provides an overview of the Unking of the scaled results from the Trial 
State Assessment to those from the national reading assessment. Details of the 
linking process appear in Chapter 9. 

• Section 1.13 describes the reporting of the assessment results, with further details 
suppUed in Chapter 10. 

• Appendices A through G include a Ust of the participants in the objectives and item 
development process, a summary of the participation rates, a Ust of the conditioning 
variables, the IRT parameters for the reading items, the reporting subgroups, 
composite and derived common background and reporting variables, a description of 
the process used to define achievement levels, and a description of analyses 
comparing the performance of monitored and unmonitored schools for the 
nonpublic-school samples. 



12 DESIGN OF THE TRIAL STATE ASSESSMENT IN READING 

The major aspects of the design for the Trial State Assessment in reading included the 
foUowing: 

• Participation at the jurisdiction level was voluntary. 

• Students from pubUc and nonpubUc schools were assessed. NonpubUc schools 
included CathoUc schools, other reUgious schools, private schools. Domestic 
Department of Defense Education Activity schools, and Bureau of Indian Affairs 
schools. Separate representative samples of pubUc and nonpubUc schools were 
selected in each participating jurisdiction and students were randomly sampled within 
schools. The size of a jurisdiction’s nonpubUc-school samples was proportional to the 
percentage of students in that jurisdiction attending such schools. 

• The fourth-grade reading assessment used for the 1994 NAEP Trial State 
Assessment, and included in the 1994 national NAEP instrument, consisted of eight 
25-minute blocks of exercises. Six of these blocks were previously administered as 
part of the 1992 national and Trial State Assessments. Each block contamed one 
reading passage and a combination of constructed-response and multiple-choice 
items. Passages selected for the assessment were drawn from texts that might be 
found and used by students in real, everyday reading. Entire stories, articles, or 
sections of textbooks were used, rather than excerpts or abridgments. Tlte type of 
items — constructed-response or multiple-choice — ^was determined by the nature of 
the task. In addition, the constructed-response items were of two types: Short 
constructed-response items required students to respond to a question with a few 
words or a few sentences, while extended constructed-response items required students 



to respond to a question with a few paragraphs. Each student was given two of the 
eight blocks of items. 

• A complex form of matrix sampling called a partially balanced incomplete block 
(PBIB) spiraling design was used. With PBIB spiraling, students in an assessment 
session received different booklets, which provides for greater reading content 
coverage than would have been possible had every student been administered the 
identical set of items, without imposing an undue testing burden on the student. 

• Background questionnaires given to the students, the students’ reading teachers, and 
the principals or other administrators provided a variety of contextual information. 
The background questionnaires for the Trial State Assessment were identical to those 
used in the national fourth-grade assessment. 

• The assessment time for each student was approximately 63 minutes. Each a sessed 
student was assigned a reading booklet that contained a 5-minute background 
questionnaire, followed by two of the eight 25-minute blocks of reading items, a 5- 
minute reading background questionnaire, and a 3-minute motivation questionnaire. 
Sixteen different booklets were assembled. 

^ The assessments took place in the five-week period between January 31 and March 
4, 1994. One-fourth of the schools in each state were assessed each week throughout 
the first four weeks; the fifth week was reserved for makeup sessions. 

• Data collection was, by law, the responsibility of each participating jurisdiction. 
Security and uniform assessment administration were high priorities. Extensive 
training was conducted to assure that the assessment would be administered under 
standard, uniform procedures. For jurisdictions that had participated in the 1992 
Trial State Assessment, 25 percent of the public-school assessment sessions and 50 
percent of the nonpublic-school assessment sessions were monitored by the 
contractor’s staff. For the remaining jurisdictions, 50 percent of both public- and 
nonpublic-school sessions were monitored. 



13 DEVELOPMENT OF READING OBJECTIVES, ITEMS, AND BACKGROUND 
QUESTIONS 

The 1994 Trial State Assessment and national NAEP program in reading were based on 
a reading framework' developed through a national consensus process, set forth by law, that 
calls for "active participation of teachers, curriculum specialists, subject matter specialists, local 
school administrators, parents, and members of the general public" (Public Law 100-297, Part C, 
1988). This same framework was used for the 1992 Trial State Assessment in reading. 



^Reading Ftvmework for the 1992 National Assessment of Educational Pmgress (Washington, DC: National Assessment 
Governing Board, U.S. Department of Education, 1992). In addition, questionnaires completed by the students, their 
reading teacher, and principal or other school administrator provided an abundance of contextual data within which to 
interpret the reading results. 



The process of developing the framework was carried out in late 1989 and early 1990 
under the direction of the National Assessment Governing Board (NAGB), which is responsible 
for formulating policy for NAEP, including developing assessment objectives and test 
specifications. To prepare the 1992 reading framework, NAGB awarded a contract to the 
Council of Chief State School Officers (CCSSO). As the framework was being developed, the 
project staff continually sought guidance and reaction from a wide range of people in the fields 
of reading and assessment, from school teachers and administrators, and from state coordinators 
of reading and reading assessment. After thorough discussion and some amendment, the 
recommended framework was adopted by NAGB in March 1990. 

The 1992 and 1994 NAEP reading assessments measured three general types of text and 
purposes for reading, the first two of which were measured at the fourth grade: 

Reading for Literary Experience usually involves the reading of novels, short stories, 
poems, plays, and essays. In these reading situations, readers explore the human 
condition and consider relationships among events, emotions, and possibilities. In 
reading for literary experience, readers are guided by what and how an author might 
write in a specific genre and by their expectations of how the text will be organized. 

The readers’ orientation when reading for literary experience usually involves 
looking for how the author explores or uncovers experiences and engaging in 
vicarious experiences through the text. 

Reading to Gain Information usually involves the reading of articles in magazines 
and newspapers, chapters in textbooks, entries in encyclopedias and catalogues, and 
entire books on particular topics. The type of prose found in such texts has its own 
features. To understand it, readers need to be aware of those features. For 
example, depending upon what they are reading, readers need to know the rules of 
literary criticism, or historical sequences of cause and effect, or scientific taxonomies. 

In addition, readers read to gain information for different purposes — for example, 
to find specific pieces of information when preparing a research project, or to get 
some general information when glancing through a magazine article. These 
purposes call for different orientations to text from those in reading for a literary 
experience because readers are focused specifically on acquiring information. 

Reading to Perform a Task usually involves the reading of documents such as bus 
or train schedules; directions for games, repairs, and classroom or laboratory 
procedures; lax or insurance forms; recipes; voter registration materials; maps; 
referenda; consumer warranties; and office memos. When they read to perform 
tasks, readers must use their expectations of the purposes of the documents and the 
structure of documents to guide how they select, understand, and apply such 
information. The readers’ orientation in these tasks involves looking for specific 
information so as to do something. Readers need to be able to apply the 
information, not simply understand it, as is usually the case in reading to be 
informed. Furthermore, readers engaging in this type of reading are not likely to 
savor the style or thought in these texts, as they might in reading for literary 
experience. Reading to Perform a Task was not measured at grade 4. 



All items underwent extensive reviews by specialists in reading, measurement, and 
bias/sensitivity, as well as reviews by representatives from State Education Agencies. The items 
repeated from the 1992 NAEP assessment were originally field tested in 1991. Additional items 
for the 1994 assessment were field tested in 1993 on a representative group of approximately 
6,800 students across 27 jurisdictions; about 500 responses were obtained to each item in the 
field test. Based on field test results, items that had not been used previously in a NAEP 
assessment were revised or modified as necessary and then again reviewed for sensitivity, 
content, and editorial concerns. With the assistance of ETS/NAEP staff and outside reviewers, 
the Reading Item Development Committee selected the items to include in the 1994 assessment. 

Chapter 2 includes specific details about developing the objectives and items for the 
Trial State Assessment. 



1.4 ASSESSMENT INSTRUMENTS 

The assembly of cognitive items into booklets and their subsequent assignment to 
assessed students was determined by a PBIB design with spiraled administration. Details of this 
design, identical to the design used in 1992, are provided in Chapter 2. In addition to the 
student assessment booklets, three other instruments provided data relating to the 
assessment— a reading teacher questionnaire, a school characteristics and policies questionnaire, 
and an lEP/LEP student questionnaire. 

The student assessment booklets contained five sections and included both cognitive and 
noncognitive items. In addition to two 25-minute sections of cognitive questions, each booklet 
included two 5-minute sets of general and reading background questions designed to gather 
contextual information about students, their experiences in reading, and their attitudes toward 
the subject, and one 3-minute section of motivation questions designed to gather information 
about the students’ levels of motivation for taking the assessment. 

The teacher questionnaire was administered to the reading teachers of the fourth-grade 
students participating in the assessment. The questionnaire consisted of three sections and took 
approximately 20 minutes to complete. The fii st section focused on teachers’ general 
background and experience; the second, on teachers’ background related to reading; and the 
third, on classroom information about reading. 

The school characteristics and policies questionnaire was given to the principal or other 
administrator in each participating school and took about 15 minutes *o complete. The 
questions asked about the principal’s background and experience, school policies, programs, 
facilities, and the demographic composition and background of the students and teachers. 

The lEP/LEP student questionnaire was completed by the teachers of those students who 
were selected to participate in the Trial State Assessment sample but who were determined by 
the school to be ineligible to be assessed because they either had an Individualized Education 
Plan (lEP) and were not mainstreamed at least 50 percent of the time, or were categorized as 
Limited English Proficient (LEP). Each questionnaire took approximately three minutes to 
complete and asked about the nature of the student’s exclusion and the special programs in 
which the student participated. 



THE SAMPLING DESIGN 



The target populations for the Trial State Assessment Program in reading consisted of 
fourth-grade students enrolled in public schools and nonpublic schools. The representative 
sample of public-school fourth graders assessed in the Trial State Assessment came from about 
100 schools in each jurisdiction, unless a jurisdiction had fewer than 100 schools with a fourth 
grade, in which case all or almost all schools were asked to participate. The nonpublic-school 
samples differed in size across the jurisdictions, with the number of schools selected 
proportional to the nonpublic-school enrollment within each jurisdiction. On average, about 15 
nonpublic schools were included for each jurisdiction. The school sample in each state was 
designed to produce aggregate estimates for the state and for selected subpopulations 
(depending upon the size and distribution of the various subpopulations within the state), and 
also to enable comparisons to be made, at the state level, between administration of assessment 
tasks with monitoring and without monitoring. The public schools were stratified by 
urbanization, percentage of Black and Hispanic students enrolled, and median household 
income. The nonpublic schools were stratified by type of control (Catholic, private/other 
religious, other nonpublic), metro status, and enrollment size per grade. 

In most states, up to 30 students were selected from each school, with the aim of 
providing an initial target sample size of approximately 3,000 public-school students per state. 
The student sample size of 30 for each school was chosen to ensure that at least 2,000 public- 
school students participated from each state allowing for school nonresponse, exclusion of 
students, inaccuracies in the measures of enrollment, and student absenteeism from the 
assessment. In states with fewer schools, larger numbers of students per school were often 
required to ensure target samples of roughly 3,000 students. In certain jurisdictions, all eligible 
fourth graders were targeted for assessment. 

Students within a school were sampled from lists of fourth-grade students. The decisions 
to exclude students from the assessment were made by school personnel, as in the national 
assessment, and were based on the same criteria for exclusion (described in section 1.4) used for 
the national assessment. Each excluded student was carefully accounted for to estimate the 
percentage of the state population deemed unassessable and the reasons for exclusion. 

Chapter 3 describes the various aspects of selecting the sample for the 1994 Trial State 
Assessment — the construction of the public- and nonpublic-school frames, the stratification 
processes, the updating of the school frames with new schools, the actual sample selection, and 
the sample selection for the field test. 



1.6 FIELD ADMINISTRATION 

The administration of the 1994 program and the 1993 field test required collaboration 
between staff in the participating states and schools and the NAEP contractors, especially 
Westat, the field administration contractor. The purpose of the field test conducted in 1993 was 
to try out blocks of items that were to be used as replacements for the 1992 assessment blocks 
released to the public. 



10 




29 



Each jurisdiction volunteering to participate in the 1993 field test and in the 1994 Trial 
State Assessment was asked to appoint a state coordinator as liaison between NAEP staff and 
the participating schools. In addition, Westat hired and trained a supervisor for each state and 
six field managers, each of which was assigned to work with groups of states. The state 
supervisors were responsible for working with the state coordinators, overseeing assessment 
activities, training school district personnel to administer the assessment, and coordinating the 
quality-control monitoring efforts. Each field manager was responsible for working with the 
state coordinators of 7-8 states and the supervision of the state supervisors assigned to those 
states. An assessment administrator was responsible for preparing for and conducting the 
assessment session in one or more schools. These individuals were usually school or district 
staff and were trained by Westat. Westat also hired and trained three to five quality control 
monitors in each state. For states that had previously participated in the state assessment 
program, 25 percent of the public-school sessions and 50 percent of the nonpublic-school 
sessions were monitored. For states new to the program, 50 percent of all sessions were 
monitored. During the field test, the state supervisors monitored all sessions. 

Chapter 4 describes the procedures for obtaining cooperation from states and provides 
details about the field activities for both the field test and 1994 program. Chapter 4 also 
describes the planning and preparations for the actual administration of the assessment, the 
training and monitoring of the assessment sessions, and the responsibilities of the state 
coordinators, state supervisors, assessment administrators, and quality control monitors. 



1.7 MATERIALS PROCESSING AND DATABASE CREATION 

Upon completion of each assessment session, school personnel shipped the assessment 
booklets and forms to NAEP subcontractor National Computer Systems for professional scoring, 
entry into computer files, and checking. The files were then sent to Educational Testing Service 
for creation of the database. Careful checking assured that all data from the field were 
received. Chapter 5 describes the printing, distribution, receipt, processing, and final disposition 
of the 1994 Trial State Assessment materials. 

The volume of collected data and the complexity of the Trial State Assessment 
processing design, with its spiraled distribution of booklets, as well as the concurrent 
administration of this assessment and the national assessments, required the development and 
implementation of flexible, innovatively designed processing programs, and a sophisticated 
Process Control System. This system, described in Chapter 5, aUowed an integration of data 
entry and workflow management systems that included carefully planned and delineated editing, 
quality control, and auditing procedures. 

Chapter 5 also describes the data transcription and editing procedures used to generate 
the disk and tape files containing various assessment information, including the sampling weights 
required to make valid statistical inferences about the population from which the Trial State 
Assessment sample was drawn. Before any analysis could begin, the data from these files 
underwent a quality control check at ETS. The files were then merged into a comprehensive, 
integrated database. Chapter 6 describes the transcribed data files, the procedure of merging 
them to create the Trial State Assessment database, and the results of the quality control 
process, and the procedures used to create data products for use in secondary research. 



1.8 



THE TRIAL STATE ASSESSMENT DATA 



The basic information collected from the Trial State Assessment in reading consisted of 
the responses of the assessed students to 85 reading exercises organized around eight distinct 
reading passages. To limit the assessment time for each student to about one hour, a partially 
balanced incomplete block (PBIB) spiral design was used to assign a subset of the full exercise 
pool to each student. The partially balanced design differed slightly from the fully balanced 
incomplete block (BIB) spiral design used for the 1990 and 1992 Trial State Assessments in 
mathematics. Both the PBIB and BIB designs are variants of matrix sampling designs. 

The full set of reading items was divided into eight unique blocks, each requiring 25 
minutes for completion. Four of the blocks contained literary passages; the items accompanying 
these blocks were designed to assess student abilities in Reading for Literary Experience. The 
other four blocks were based on informational prose passages (e.g., magazine articles, newspaper 
articles, sections of textbook chapters, etc.); the items accompanying these passages were 
designed to assess student abilities in Reading to Gain Information. Each assessed student 
received a booklet containing two of the eight blocks according to a design that ensured that 
each block was administered to a representative sample of students within each jurisdiction. 

The design also ensured that each Reading for Literary Experience block was paired in exactly 
one booklet with every other Reading for Literary Experience block. Similarly, each Reading to 
Gain Information block was paired in exactly one booklet with every other Reading to Gain 
Information block. Furthermore, each Reading for Literary Experience block was paired in 
exactly one booklet with one of the Reading to Gain Information blocks. The data also included 
responses to the background questionnaires (described in section 1.4). Further details on the 
assembly of cognitive instruments and the data collection design can be found in Chapter 2. 

The national data to which the Trial State Assessment results were compared were taken 
from nationally representative samples of public- and nonpublic-school students in the fourth 
grade. These samples were a part of the full 1994 national reading assessment, in which 
nationally representative samples of students in public and private schools from three age 
cohorts were assessed: students who were either in the fourth grade or 9 years old; students 
who were either in the eighth grade or 13 years old; and students who were either in the twelfth 
grade or 17 years old. 

The assessment instruments used in the Trial State Assessment were also used in the 
fourth-grade national assessments and were administered using the identical procedures in both 
assessments. The time of testing for the state assessments (January 31 to February 25, 1994) 
occurred within the time of testing of the national assessment (January 3 to April 1, 1994). The 
state assessments differed from the national assessment, however, in one important regard: 
Westat staff collected the data for the national assessment while, in accordance with the NAEP 
legislation, data collection activities for the Trial State Assessment were the responsibility of 
each participating jurisdiction. The data collection activities included ensuring the participation 
of selected schools and students, assessing students according to standardized procedures, and 
observing procedures for test security. 



12 



31 





1.9 WEIGHTING AND VARIANCE ESTIMATION 



A complex sample design was used to select the students to be assessed in each of the 
participating jurisdictions. The properties of a sample from a complex design are very different 
from those of a simple random sample in which every student in the target population has an 
equal chance of selection and in which the observations from different sampled students can be 
considered to be statistically independent of one another. The properties of the sample from 
the complex Trial State Assessment design were taken into account in the analysis of the 
assessment data. 

One way that the properties of the sample design were addressed was by using sampling 
weights to account for the fact that the probabilities of selection were not identical for all 
students. These weights included adjustments for school and student nonresponse. All 
population and subpopulation characteristics based on the Trial State Assessment data used 
sampling weights in their estimation. Chapter 7 provides details on the computation of these 
wei^ts. 

In addition to deriving appropriate estimates of population characteristics, it is essential 
to obtain appropriate measures of the degree of uncertainty of those statistics. One component 
of uncertainty is a result of sampling variability, which measures the depender.ee v.-: th" r‘*:;jlts 
on the particular sample of students actually assessed. Because of the effects of cluster selection 
(schools are selected first, then students are selected within those schools), observations made 
on different students cannot be assumed to be independent of each o'.her (and, in fact, are 
generally positively correlated). As a result, classical variance estirr.ation formulas will produce 
incorrect results. Instead, a variance estimation procedure that takes the characteristics of the 
sample into account was used for all analyses. This procedure, called jackknife variance 
estimation, is discussed in Chapter 7. 

Jackknife variance estimation provides a reasonable measure of uncertainty for any 
statistic based on values observed without error. Statistics such as the average proportion of 
students correctly answering a given question meet this requirement, but other statistics based 
on estimates of student reading proficiency, such as the average reading proficiency of a 
subpopulation, do not. Because each student typically responds to relatively few items within a 
particular purpose of reading (i.e., for literary experience or to gain information), there exists a 
nontrivial amount of imprecision in the measurement of the proficiency of a given student. This 
imprecision adds an additional component of variability to statistics based on estimates of 
individual proficiencies. The estimation of this component of variability is discussed in 
Chapter 8. 



I.IO PRELIMINARY DATA ANALYSIS 

After the computer files of student responses were received from NCS, all cognitive and 
noncognitive items were subjected to an extensive item analysis. Each block of cognitive items 
was subjected to item analysis routines, which yielded for each item the number of respondents, 
the percentage of responses in each category (100 x item score), the percentage who omitted the 
item, the percentage who did not reach the item, and the correlation between the item score 
and the item block score (r-polyserial). In addition, the item analysis program provided 

13 




32 



summary statistics for each block, including reliability (internal consistency) coefficient. These 
analyses were used to check on the scoring of the items, to verify the appropriateness of the 
difficulty level of the items, and to check for speededness. The results also were reviewed by 
knowledgeable project staff in search of aberrations that might signal unusual results or errors in 
the database. 

Tables of the weighted percentages of students with responses in each category of each 
cognitive and background item were created and distributed to each state and jurisdiction. 
Additional analyses comparing the data from the monitored sessions with those from the 
unmonitored sessions were conducted to determine the comparability of the assessment data 
from the two types of administrations. Differential item functioning (DIF) analyses were carried 
out to identify items that were differentially difficult for various subgroups and to reexamine 
such items with respect to their fairness and their appropriateness for inclusion in the scaling 
process. Further details of the preliminary analyses appear in Chapter 9. 



1.11 SCALING THE ASSESSMENT ITEMS 

The primary analysis and reporting of the results from the Trial State Assessment used 
item response theory (IRT) scale-score models. Scaling models quantify a respondent’s 
tendency to provide correct answers to the domain of items contributing to a scale as a function 
of a parameter called proficiency. Proficiency can be viewed as a summary measure of 
performance across the domain of items that make up the scale. Three distinct IRT models 
were used for scaling: 1) 3-parameter logistic models for multiple-choice items; 2) 2-parameter 
logistic models for short constructed-response items that were scored correct or incorrect; and 
3) generalized partial credit models for short and extended constructed-response items that were 
scored on a multipoint scale. Chapter 8 provides an overview of the scaling models used. 

Further details on the application of these models are provided in Chapter 9. 

Two distinct scales were created for the Trial State Assessment to summarize fourth- 
grade students’ reading abilities according to two purposes for re' -’-ng: Reading for Literary 
Experience and Reading to Gain Information. For reasons discussed in Chapter 9, these scales 
were defined identically to, but separately from, those used for the scaling of the national NAEP 
fourth-grade reading data. Although the items comprising each scale were identical to those 
used for the national program, the item parameters for the Trial State Assessment scales were 
estimated from combined data from all jurisdictions participating in the Trial State Assessment. 
Item parameter estimation was based on an item calibration sample consisting of an 
approximately 25 percent sample of all available data. To ensure equal representation in the 
scaling process, each jurisdiction was equally represented in the item calibration sample, as were 
monitored and unmonitored administrations from each jurisdiction. Chapter 9 provides further 
details about item parameter estimation. 

The fit of the IRT model to the obser\'ed data was examined within each scale by 
comparing the estimate?, of the empirical item characteristic functions with the theoretic curves. 
For binary-scored items, nonmodel-based estimates of the expected proportions of correct 
responses to each item for students with various levels of scale proficiency were compared with 
the fitted item response curve; for the short and extended partial-credit constructed-response 
items, the comparisons were based on the expected proportions of students with various levels of 




scale proficiency who achieved each score level. In general, the item level results were well fit 
by the scaling models. 

Using the item parameter estimates, estimates of various population statistics were 
obtaL. jd for each jurisdiction. The NAEP methods use random draws ("plausible values") from 
estimated proficiency distributions for each student to compute population statistics. Plausible 
values are not optimal estimates of individual student proficiencies; instead, they serve as 
intermediate values to be used in estimating population characteristics. Under the assumptions 
of the scaling models, these population estimates will be consistent, in the sense that the 
estimates approach the model-based population values as the sample size increases, which would 
not be the case for subpopulation estimates obtained by aggregating optimal estimates of 
individual proficiency. Chapter 8 provides further details on the computation and use of 
plausible values. 

In addition to the plausible values for each scale, a composite of the two reading scales 
was created as a measure of overall reading proficiency. This composite was a weighted average 
of the plausible values for the two reading scales, in which the wei^ts were proportional to the 
relative importance assigned to each purpose in the reading objectives. The definition of the 
composite for the Trial State Assessment program was identical to that used for the national 
fourth-grade reading assessment. More details about composite scores are given U) Chapter 9. 



1.12 LINKING THE TRIAL STATE RESULTS TO THE NATIONAL RESULTS 

A major purpose of the Trial State Assessment Program was to allow each participating 
jurisdiction to compare its 1994 results with the nation as a whole and with the region of the 
country in which that jurisdiction is located. For meaningful comparisons to be made between 
each of the Trial State Assessment jurisdictions and the relevant national sample, results from 
these two assessments had to be expressed in terms of a similar system of scale units. 

The results from the Trial State As*: ■. ..e linked to those from the national 

assessment through linking functions determined by comparing the results for the aggregate of 
all students assessed in the Trial State Assessment with the results for fourth-grade students in 
the State Aggregate Comparison (SAC) subsample of the national NAEP. The SAC subsample 
of the national NAEP is a representative sample of the population of all grade-eligible public- 
school students within the aggregate of 41 participating states and the District of Columbia 
(Guam and the Department of Defense Overseas Education Activity schools were not included 
in the aggregate). Specifically, the grade 4 SAC subsample consists of all fourth-grade students 
in public schools in the states and the District of Columbia who were assessed in the national 
cross-sectional reading assessment. 

A linear transformation within each scale was used to link the results of the Trial State 
Assessment to the national assessment. The adequacy of linear linking was evaluated by 
comparing, for each scale (Reading for Literary Experience and Reading to Gam Information), 
the distribution of reading proficiency based on the aggregation of all assessed students at each 
grade from the participating states and the District of Columbia with the equivalent distribution 
based on the students in the SAC subsample. In the estimation of these distributions, the 
students were weighted to represent the target population of public-school students in the 

15 




34 



^ecified grade in the aggregation of the states and the District of Columbia. If a linear linking 
were adequate, the distribution for the aggregate of states and the District of Columbia and that 
for the SAC subsample would have, to a close approximation, the same shape in terms of the 
skewness, kurtosis, and higher moments of the distributions. The only differences in the 
distributions allowed by linear linking would be in the means and variances. Generally, this was 
found to be the case. 

Each reading scale was linked by matching the mean and standard deviation of the scale 
proficiencies across all students in the Trial State Assessment (excluding Guam and the 
Department of Defense Overseas Activity Schools) to the corresponding scale mean and 
standard deviation across all students in the SAC subsample. Further details of the linking are 
given in Chapter 9. 



1.13 REPORTING THE TRIA! , STATE ASSESSMENT RESULTS 

Each jurisdiction in the Trial State Assessment received a summary report providing the 
state’s results with accompanying text and tables, national and regional compariso'is, and (for 
states that had participated in the 1992 state program) trend comparisons to the previous 
assessment. These reports were generated by a computerized report-generation system for 
which graphic designers, statisticians, data analysts, and report writers collaborated to develop 
shells of the reports in advance of the analysis. These prototype reports were provided to State 
Education Agency personnel for their reviews and comments. The results of the data analysis 
were then automatically incorporated into the reports, which displayed tables and graphs of the 
results and interpretations of those results, including indications of subpopulation comparisons 
of statistical and substantive significance. 

Each report contained state-level estimates of mean proficiencies, both for the state as a 
whole and for categories of the key reporting variables: gender, race/ethnicity, level of parental 
education, and type of location. Results were presented for each reading proficiency scale, for 
the overall reading composite, and by achievement levels. Results were also reported for a 
variety of other subpopulations based on variables derived from the student, teacher, and school 
questionnaires. Standard errors are included for all statistics. 

A second report, 1994 NAEP Reading: A First Look, was released in April of 1995 
(several months prior to the release of the state reports and the other documents described 
below), presenting selected national and state public-school findings for the composite reading 
proficiency scale. The report compared 1994 results to 1992 results and included findings with 
respect to the reading achievement levels established by the National Assessment Governing 
Board. 



A third report, the NAEP 1994 Reading Report Card for the Nation and the States, 
highlighted key assessment results for the nation and summarized results across the states and 
territories participating in the assessment. This report contained composite scale results 
(proficiency means, proportions at or above achievement levels, etc.) for the nation, for each of 
the four regions of the country, and for each jurisdiction in the Trial State Assessment, both 
overall, by the primary demographic reporting variables, and for both public and nonpublic 
schools. In addition, results were reported for each of the reading scales. 



16 




35 



The fourth report, entitled Cross-state Data Compendium for the NAEP 1994 Reading 
Assessment, contains state-by-state results for all variables reported on in the Report Card and 
State Report . 

The fifth report is a six-section almanac. The first section, or "distribution" section, 
provides results for the achievement levels and percentiles. Three of the sections of the almanac 
(referred to as proficiency sections) present summary tables based on responses to each of the 
questionnaires (student, reading teacher, and school) administered as part of the Trial State 
Assessment. The fifth section of the almanac, the scale section, reports proficiency means and 
associated standard errors for the two purpose-for-reading scales. Results in this section are 
reported for the total group in each state, as well as for select subgroups of interest. The final 
section of the almanac, the "p-value" section, provides the total-group proportion for each 
response alternative for each cognitive item included in the assessment. 

The production of the state reports, Reading Report Card, Cross-State Data Compendium, 
and the almanacs required a large number of decisions about a variety of data analysis and 
statistical issues. Chapter 10 documents the major conventions and statistical procedures used 
in generating the reports and almanacs. The chapter describes the rules, based on effect size 
and school and student sample size considerations, that were used to establish whether a 
particular category contained data sufficient to report reliable results for a particular state. 
Chapter 10 also describes the multiple comparison and effect-size-based inferential rules that 
were used for evaluating the statistical and substantive significance of subpopulation 
comparisons. 

To provide information about the generalizability of the results, a variety of information 
about participation rates was reported for each jurisdiction. This information included school 
participation rates, both in terms of the initially selected samples of schools and in terms of the 
finally achieved samples, including replacement schools. The student participation rates, the 
rates of students excluded due to Limited English Proficiency (LEP) and Individualized 
Education Plan (lEP) status, and the estimated proportions of assessed students who are 
classified as lEP or LEP were also reported for each state. These rates are described and 
reported in Appendix B. 




Chapter 2 



DEVELOPING THE OBJECTIVES, COGNITIVE ITEMS, 
BACKGROUND QUESTIONS, AND ASSESSMENT INSTRUMENTS 



Jay R. Campbell and Patricia L. Donahue 
Educational Testing Service 



2.1 OVERVIEW 

The framework that was developed for the 1992 NAEP Trial State Assessment in 
reading also served as the framework for the 1994 assessment. Similar to all previous NAEP 
assessments, the objectives in reading were developed through a broad-based consensus process. 
To prepare the framework and objectives, initially for the 1992 reading assessment, the National 
Assessment Governing Board (NAGB) contracted with the Council of Chief State School 
Officers (CCSSO). The development process involved a steering committee, a planning 
committee, and CCSSO project staff. Educators, scholars, and citizens, representative of many 
diverse constituencies and points of view, participated in the national consensus process to 
design objectives for the reading assessment. 

The instrument used in the 1994 reading assessment was composed of a combination of 
reading passages and questions from the 1992 assessment and a set of passages and questions 
newly developed for 1994. Those passages and questions carried over from the 1992 instrument 
comprised two-thirds of the 1994 instrument. The remaining third was made up of new passages 
and questions developed according to the same framework that was used for the 1992 
assessment. Maintaining two-thirds of the instrument across the two assessment years allowed 
for the reporting of trends in reading performance. At the same time, developing a new set of 
passages and questions made it possible to release one-third of the 1992 assessment for public 
use. 



In developing the new portion of the 1994 NAEP reading assessment, the same 
framework, objectives, and procedures used in 1992 were followed. After careful reviews of the 
objectives, reading materials were selected and questions were developed that were appropriate 
to the objectives. All questions underwent extensive reviews by specialists in reading, 
measurement, and bias/sensitivity, as well as reviews by state representatives. 

The objectives and question development effort were governed by four major 
considerations: 



The objectives for the reading assessment had to be developed through a 
consensus process, involving subject-matter experts, school administrators, 
teachers, and parents. 



• As outlined in the ETS proposal for the administration of the NAEP contract 
(ETS, 1992), the development of the items had to be guided by a Reading 
Instrument Development Panel and receive further review by state 
representatives and classroom teachers from across the country. In addition, the 
items had to be carefully reviewed for potential bias. 

• As described in the ETS Standards of Quality and Fairness (ETS, 1987), all 
materials developed at ETS had to be in compliance with specified procedures. 

® As per federal regulations, all NAEP cognitive and background items had to be 
submitted to a federal clearance process. 

This chapter includes details about developing the objectives and items for the Trial 
State Assessment in reading. The chapter also describes the instruments, the student 
assessment booklets, reading teacher questionnaire, school characteristics and policies 
questionnaire, and lEP/LEP student questionnaire. Various committees worked on the 
development of the framework, objectives, and items for the reading assessment. The list of 
committee members and consultants who participated in the 1994 development process is 
provided in Appendix A. 



22 FRAMEWORK AND ASSESSMENT DESIGN PRINCIPLES 

The reading framework for the 1992 and 1994 assessments was developed according to 
guidelines established by the Steering Committee. These guidelines determined that the design 
of the framework be performance-oriented with a focus on reading processes. The framework 
would embody a broad view of reading that addressed the high levels of literacy needed for 
employability, personal development, and citizenship. Also*, the framework would take into 
account contemporary research on reading and literacy and would expand the range of 
assessment tools to include formats that more closely resembled desired classroom activities. 

The objectives development was guided by the consideration that the assessment should 
reflect many of the states’ curricular emphases and objectives in addition to what various 
scholars, practitioners, and interested citizens believed should be included in the curriculum. 
Accordin^y, the committee gave attention to several frames of reference: 

• The purpose of the NAEP reading assessment is to provide information about 
the progress and achievement of students in general rather than to test individual 
students’ ability. NAEP is designed to inform policy makers and the public about 
reading ability in the United States. Furthermore, NAEP state data can be used 
to inform states of their students’ relative strengths and weaknesses. 

• The term "reading literacy" should be used in the broad sense of knowing when to 
read, how to read, and how to reflect on what has been read. It represents a 
complex, interactive process that goes beyond basic or functional literacy. 



20 




38 



• The reading assessment should use authentic passages and tasks that are both 
broad and complete in their coverage of important reading behaviors so that the 
assessment tool will demonstrate a close link to desired classroom instruction and 
students’ reading experiences. 

• Every effort should be made to make the best use of available methodology and 
resources in driving assessment capabilities forward. 

• Every effort must be made in developing the assessment to represent a variety of 
opinions, perspectives, and emphases among professionals in universities, as well 
as in state and local school districts. 



23 FRAMEWORK DEVELOPMENT PROCESS 

The National Assessment Governing Board is responsible for guiding NAEP, including 
the development of the reading assessment objectives and test specifications. Appointed by the 
Secretary of Education from lists of nominees proposed by the board itself in various statutory 
categories, the 24-member board is compo<^ed of state, local, and federal officials, as well as 
educators and members of the public. 

NAGB began the development process for the 1992 reading objectives (that also served 
as objectives for the 1994 assessment) by conducting a widespread mail review of the objectives 
for the 1990 reading assessment and by holding a series of public hearings throughout the 
country. The contract for managing the remainder of the consensus process was awarded to the 
Council of Chief State School Officers. The development process included the following 
activities: 

• A Steering Committee consisting of members recommended by each of 15 
national organizations (see Appendix A) was established to provide guidance for 
the consensus process. The committee responded to the progress of the project 
and offered advice. Drafts of each version of the document were sent to 
members of the committee for review and reaction. 

• A Planning Committee (see Appendix A) was established to identify the 
objectives to be assessed in reading in 1992, and subsequently in 1994, and to 
prepare the framework document. The members of this committee consisted of 
experts in reading, including college professors, an academic dean, a classroom 
teacher, a school administrator, state-level assessment and reading specialists, and 
a representative of the business community. This committee met with the 
Steering Committee and as a separate group. A subgroup also met to develop 
item specifications. Between meetings, members of the committee provided 
information and reactions to drafts of the framework. 

• The project staff at the Council of Chief State School Officers met regularly with 
staff from the National Assessment Governing Board and the National Center for 
Education Statistics to discuss progress made by the Steering and Planning 
committees. 



21 



33 



During this development process, input and reactions were continually sought from a 
wide lange of members of the reading field, experts in ass^ment, school administrators, and 
state staff in reading assessment. In particular, the process was informed by innovative state 
assessment efforts and work being done by the Center for the Learning and Teaching of 
Literature (Langer, 1989, 1990). 



2.4 FRAMEWORK FOR THE ASSESSMENT 

The framework adopted for the 1992 reading assessment and used for developing new 
portions of the 1994 instrument is organized according to a four-by-three matrix of reading 
stances by reading purposes. These stances included; 

• Initial Understanding, 

• Developing an Interpretation, 

• Personal Reflection and Response, and 

• Demonstrating a Critical Stance. 

These stances were assessed across three global purposes defined as; 

• Reading for Literary Experience, 

• Reading to Gain Information, and 

• Reading to Perform a Task. 

Different types of texts were used to assess the various purposes for reading. Students’ 
reading abilities were evaluated in terms of a single purpose for each type of text. At grade 4, 
only reading for literary experience and reading to gain information were assessed, while all 
three global purposes were assessed at grades 8 and 12. Figures 2-1 and 2-2 describe the four 
reading stances and three reading purposes that guided the development of the 1992 and 1994 
Trial State Assessments in reading. The interactions among the aspects of reading assessed are 
presented in Figure 2-3. 



2.5 DISTRIBUTION OF ASSESSMENT ITEMS 

In recognition that the demands made of readers change as readers move from grade to 
grade, the Planning Committee recommended that the proportion of items related to each of 
the reading purposes vary according to grade level. The relative contribution of each reading 
purpose to the overall proficiency score is presented in Table 2-1. The weighting of each 
reading purpose scale changes from grade to grade to reflect the changing demands made of 
students as they mature. 



Figure 2-1 

Description of Reading Stances 



Readers interact with text in various ways as they use background knowiedge and 
understanding of text to construct, extend, and examine meaning. The NAEP reading 
assessment framework sp>ecified four reading stances to be assessed that represent 
various interactions between readers and texts. These stances are not meant to 
describe a hierarchy of skiiis or abiiities. Rather, they are intended to describe 
behaviors that readers at aii deveiopmentai ieveis shouid exhibit. 



Initial Understanding 

Inttiai understanding requires a broad, preiiminary construction of an understanding 
of the text. Questions testing this aspect ask the reader to provide an initiai impression 
or unrefiected understanding of what was read, in the 1992 and 1994 NAEP reading 
assessments, the first question foiiowing a passage was usuaiiy one testing initiai 
understanding. 



Developing an Interpretation 

Deveioping an interpretation requires the reader to go beyond the initiai impression 
to deveiop a more complete understanding of what was read. Questions testing this 
aspect require a more specific understanding of the text and involve linking information 
across parts of the text as well as focusing on specific information. 



Personal Reflection and Response 

Personal response requires the reader to connect knowiedge from the text more 
extensively with his or her own personal background knowledge and experience. The 
focus is on how the text relates to personal experience: questions on this aspect ask 
the readers to reflect and respond from a personal perspective. For the 1992 and 
1994 NAEP reading assessments, personal response questions were typically formatted 
as constructed-response items to allow for individual interpretations and varied 
responses. 



Demonstrating a Critical Stance 

Demonstrating a critical stance requires the reader to stand apart from the text, 
consider It, and judge It objectively. Questions on this aspect require the reader to 
perform a variety of tasks such as critical evaluation, comparing and contrasting, 
application to practical tasks, and understanding the impact of such text features as 
irony, humor, and organization. These questions focus on the reader as 
Interpreter/critic and require reflection and judgments. 



23 




O 

ERIC 



Figure 2-2 

Description of Purposes for Reading 



Reading involves an interaction between a specific type of text or written material and 
a reader, who typically has a purpose for reading that is related to the type of text and 
the context of the reading situation. The 1992 and 1994 NAEP reading assessments 
presented three types of text to students representing each of three reading purposes: 
literary text for literary experience, informational text to gain information, and 
documents to perform a task. Students' reading abilities were evaluated in terms of a 
single purpose for each type of text. 



Reading for Literary Experience 

Reading for literary experience involves reading literary text to explore the human 
condition, to relate narrative events with personal experiences, and to consider the 
interplay in the selection among emotions, events, and possibilities. Students in the 
NAEP reading assessment were provided with a wide variety of literary text, such as 
short stories, poems, fables, historical fiction, science fiction, and mysteries. 



Reading to Gain information 

Reading to gain information involves reading informative passages in order to 
obtain some general or specific information. This often requires a more utilitarian 
approach to reading that requires the use of certain reading/thinking strategies 
different from those used for other purposes. In addition, reading to gain information 
often involves reading and interpreting adjunct aids such as charts, graphs, maps, and 
tables that provide supplemental or tangential data. Informational passages in the 
NAEP reading assessment included biographies, science articles, encyclopedia entries, 
primary arxl secondary historical accounts, and newspaper editorials. 



Reading to Perform a Task 

Reading to perform a task involves reading various types of materials for the 
purpose of applying the information or directions in completing a specific task. The 
reader’s purpose for gaining meaning extends beyond understanding the text to 
include the accomplishment of a certain activity. Documents requiring students in the 
NAEP reading assessment to perform a task included directions for creating a time 
capsule, a bus schedule, a tax form, and instructions on how to write a letter to a 
senator. Reading to perform a task was assessed only at grades 8 and 12. 



24 




42 



Figure 2-3 

1992 and 1994 NAEP Framework — Aspects of Reading Literacy 





Reading Stances 






Initial 

Understanding 


Developing an 
Interpretation 


Personal Reflection 
and Response 


Dtmonstratmg a 
Critical Stance 


Purpo«M 
for Reading 


Requires the reader to 
provide an initial 
impression or unre- 
flected understanding 
of what was read. 


Requires the reader to 
go beyond the initial 
impression to develop a 
more complete 
understanding of what 
was read. 


Requires the reader to 
connect krKJwledge 
from the text with 
his/her own persorwl 
background krwwiedge. 


Requires the reader to 
stand apart from the 
text and consider it. 




What is the story/plot 
about? 


How did the plot 
develop? 


How did this character 
change your idea of 


Rewrite this story with 
as a setting or 




? 


as a character. 


Rasding for 

Utarary 

Exparianca 


i How would you 
describe the main 
character? 


How did this character 
change from the 
beginning to the end of 
the story? 


is this story similar to or 
different from your own 
experiences? 


How does this author's 
use of (irony, 

personification, humor) 
contribute to ? 


Raading to Gain 


What does this article 
tell you about 
? 


What caused this 
event? 


What cunent event 
does this remind you 
of? 


How useful would this 
article be for ? 

Explain. 


Information 


What does the author 
think about this topic? 


In what ways are these 
ideas important to the 
topic or theme? 


Does this description fit 
what you know about 
? Why? 


What could be added to 
improve the author's 
argument? 










Raadmg to 
Parform 


What is this supposed 
to help you do? 


What will be the result 
of this step in the 
directions? 


In order to , 

what information would 
you need to find that 
you don't know right 
now? 


Why is this information 
needed? 


a Task 


What time can you get 
a non-stop flight to X? 


What must you do 
before this step? 


Describe a situation 
where you could leave 
out step X. 


What would happen if 
you omitted this? 



25 



43 




Table 2-1 

Weighting of the Reading Purpose Scales on the Composite Reading Scale 



Grade 


Purposes for Reading 


For Literary Experience 


To Gain Information 


To Perform a Task 


4 


55% 


45% 


(No Scale) 


8 


40% 


40% 


20% 


12 


35% 


45% 


20% 



Table 2-2 

Percentage Distribution of Items by Reading Stance for All Grades 
as Specified by the Reading Framework 



Initial Understanding/ 
Developing an Interpretation 


Personal Reflection 
and Response 


Demonstrating a 
Critical Stance 


33% 


33% 


33% 



26 



44 




Readers use a range of cognitive abilities and assume various stances as they engage in 
various reading experiences. In the 1992 and 1994 NAEP reading assessments, four stances 
were assessed within each of the reading purposes. While reading, students form an initial 
understanding of the text and connect ideas within the text to generate interpretations. In 
addition, they extend and elaborate their understanding by responding to the text personally and 
critically and by relating ideas in the text to prior experiences or knowledge. In ac^rdance with 
development specifications, items were developed to fulfill the reading stance requirements. 
Table 2-2 shows the distribution of items by reading stances for all three grade levels, as 
specified in the NAEP Reading Framework. The distribution requirements for the exercise 
specifications combined the stances of initial understanding and developing an interpretation. 



2.6 DEVELOPING THE COGNITIVE ITEMS 

The development of cognitive items began with a careful selection of grade-appropriate 
passages for the assessment. Passages were selected from a pool of reading selections 
contributed by teachers from across the country. The framework stated that the assessment 
passages should represent authentic, naturally occurring reading material that students may 
encounter in or out of school. These passages were reproduced in test booklets as they had 
appeared in their original publications. Final passage selections were made by the Reading 
Instrument Development Panel. Lastly, in order to guide the development of items, passages 
were outlined or mapped to identify essential elements of the text. 

The Trial State Assessment included constructed-response (short and extended) and 
multiple-choice items. The decision to use a specific item type was based on a consideration of 
the most appropriate format for assessing the particular objective. Both t^es of constructed- 
response items were designed to provide an in-depth view of students’ ability to read 
thoughtfully and generate their own responses to reading. Short constructed-response questions 
(scored with either a 2- or 3- level scoring rubric) were used when students needed to respond 
in only one or two sentences in order to demonstrate full comprehension. Extended 
constructed-response questions (scored with a 4-level scoring rubric) were used when the task 
required more thoughtful consideration of the text and engagement in more complex reading 
processes. Multiple-choice items were used when a straightforward, single correct answer was 
all that was required. Guided by the NAEP reading framework, the Instrument Development 
Panel monitored the development of aU three types of items to assess objectives in the 
framework. For more information about item scoring, see Chapter 5. 

Trial State Assessment at grade 4 included eight different 25-minute blocks,^ each 
consisting of one reading passage and a set of multiple-choice and constructed-response items to 
assess students’ comprehension of the written material. Students were asked to respond to two 
25-minute blocks within one booklet. Four blocks assessed reading for literary experience and 
four assessed reading to gain information. Even though the number of items varied within each 
block, the amount of assessment time was the same for each block and each reading purpose. 



As with the 1992 instrument development effort, a detailed series of steps was used to 
create the new assessment items for 1994 that reflected the objectives. 



1. Item specifications and prototype items were provided in the 1992 and 1994 
Reading Framework. 

2. The Reading Instrument Development Panel provided guidance to NAEP staff 
about how the objectives could be measured given the realistic constraints of 
resources and the feasibility of measurement technology. The Panel made 
recommendations about priorities for the assessment and types of items to be 
developed. 

3. . dssages were chosen for the assessment through an extensive selection process 
that involved the input of teachers from across the country as well as the Reading 
Instrument Development Panel. 

4. Item writers from both inside and outside ETS were selected based on their 
knowledge about reading theory and practices and experience in creating items 
according to specifications. 

5. The items were reviewed and revised by NAEP/ETS staff and external test 
specialists. 

6. Passages and items were reviewed by grade-appropriate teachers across the 
country for developmental appropriateness. 

7. Representatives from the State Education Agencies met and reviewed all items 
and background questionnaires (see section 2.8 for a discussion of the 
background questionnaires). 

8. Language editing and sensitivity reviews were conducted according to ETS quality 
control procedures. 

9. Field test materials were prepared, including the materials necessary to secure 
clearance by the Office of Management and Budget. 

10. The field test was conducted in 23 states, the District of Columbia, and three 
territories. 

11. Representatives from State Education Agencies met and reviewed the field test 
results. 



Based on the field test analyses, new items for the 1994 assessment were revised, 
modified, and re-edited, where necessary. The items once again under went ETS 
sensitivity review. 

The Reading Instrument Development Panel selected the blocks to include in the 
1994 assessment. 



28 



4G 



14. After a final review and check to ensure that each assessment booklet and each 

block met the overall guidelines for the assessment, the booklets were typeset and 
printed. In total, the items that appeared in the Trial State Assessment 
underwent 86 separate reviews, including reviews by NAEP/ETS staff, external 
reviewers. State Education Agency representatives, and federal officials. 

The overall pool of items for the Trial State Assessment consisted of 84 items, including 
37 short constructed-response items, 8 extended constructed-response items, and 39 multiple- 
choice items. Table 2-3 provides the percentage of assessment time (based on field observations 
and basic assumptions made in item development) devoted to each reading stance within the 
two purposes for reading. 



Table 2-3 

Percentage of Assessment Time Devoted to the Reading Stances 
Within Each Purpose for Reading for the 1994 Reading Trial State Assessment 



Grade 


Purpose 
for Reading 


Initial Understanding/ 
Developing an 
Interpretation 


Personal Reflection 
and Response 


Demonstrating a 
Critical Stance 


Target 


Actual* 


Target 


Actual* 


Target 


Actual* 


4 


For Literary Experience 


33% 


45% 


33% 


22% 


33% 


33% 




To Gain Information 


33% 


52% 


33% 


27% 


33% 


20% 




Overall 


33% 


49% 


33% 


25% 


33% 


27% 



•Actual percentages are based on the classifications agreed upon by NAEP’s Instrument Development Panel. 



2.7 STUDENT ASSESSMENT BOOKLETS 

Each student assessment booklet included two sections of cognitive reading items and 
three sections of background questions. The assembly of reading blocks into booklets and their 
subsequent assignment to sampled students was determined by a partially balanced incomplete 
block (PBIB) design with spiraled administration. 

The first step in implementing PBIB spiraling for the grade 4 reading assessment 
required constructing blocks of passages and items that required 25 minutes to complete. These 
blocks were then assembled into booklets containing two 5-minute background sections, one 3- 
minute motivation questionnaire, and two 25-minute blocks of reading passages and items 
according to a partially balanced incomplete block design. The overall assessment time for each 
student was approximately 63 minutes. 

At the fourth-grade level, the blocks measured two purposes for reading— Reading for 
literary experience and reading to gain information. The reading blocks were assigned to 

29 




47 



booklets in such a way that every block within a given purpose for reading was paired with every 
other block measuring the same purpose but was only paired with one block measuring the 
other purpose for reading. Every block appears in four booklets, three times within booklets 
measuring the same purpose and once in a booklet measuring both purposes. This is the 
partially balanced part of the balanced incomplete block design. 

The PBIB design for the both the 1992 and 1994 national reading assessments (and also 
for the trial state assessments) was focused, in that each block was paired with every other 
reading block assessing the same purpose for reading but not with all the blocks assessing the 
other purpose for reading. The focused-?BlB design also balances the order of presentation of 
the blocks of items— every block appears as the first cognitive block in two booklets and as the 
second cognitive block in two other booklets. This design allows for some control of context 
effects (see Chapter 9). 

The design recjuired that eight blocks of grade 4 reading items be assembled into sixteen 
booklets. The assessment booklets were then spiraled and bundled. Spiraling involves 
interweaving the booklets in a systematic sequence so that each booklet appears an appropriate 
number of times in the sample. The bundles were designed so that each booklet would appear 
equally often in each position in a bundle. 

The final step in the PBlB-spiraling procedure was the assigning of the booklets to the 
assessed students. The students within an assessment session were assigned booklets in the 
order in which the booklets were bundled. Thus, students in an assessment session received 
different booklets, and only a few students in a session received the same booklet. Across all 
jurisdictions in the Trial State Assessment, representative and randomly equivalent samples of 
about 25,625 students responded to each item. 

Table 2-4 provides the composition of each block of items administered in the 
Trial State Assessment Program in reading. Table 2-5 shows the order of the blocks in each 
booklet and how the 8 cognitive blocks were arranged across the 16 booklets to achieve the PBIB- 
spiral design. The 1994 design was identical to that used in 1992. The two new blocks that were 
developed for the 1994 assessment at grade 4 (R8 and R9) were arranged within the booklet design in 
the same manner as were the 1992 blocks that they replaced. 



2.8 QUESTIONNAIRES 

As part of the Trial State Assessment (as well as the national assessment), a series of 
questionnaires was administered to students, teachers, and school administrators. Similar to the 
development of the cognitive items, the development of the policy issues and questionnaire items was 
a consensual process that involved staff work, field testing, and review by external advisory groups. 

A Background Questionnaire Panel drafted a set of policy issues and made recommendations 
regarding the design of the questions. They were particularly interested in capitalizing on the unique 
properties of NAEP and not duplicating other surveys (e.g., the National Survey of Public and Private 
School Teachers and Administrators, the School and Staffing Study, and the National Educational 
Longitudinal Study). 



30 



48 



o 



Table 2-4 

Cognitive and Noncognitive Block Information 



Block 


Type 


Total 
Number 
of Items 


Number of 
Multiple- 
choice Items 


Number of 
CoDStructed- 
response 
Items 


Booklets 

Containing 

Block 


B1 


Common Background 


22 


22 


0 


30 - 45 


R2 


Reading Background 


15 


15 


0 


30-45 


RB 


Reading Motivation 


5 


5 


0 


30 - 45 


R3 


Reading for Literary Experience 


11 


6 


5 


30, 31, 35, 43 


R4 


Reading for Literary Experience 


12 


5 


7 


30, 33, 34, 42 


R5 


Reading for Literary Experience 


11 


7 


4 


31, 32, 34, 44 


R6 


Reading to Gain Information 


10 


5 


5 


36, 39, 40, 44 


R7 


Reading to Gain Information 


10 


4 


6 


37, 38, 40, 42 


R8* 


Reading to Gain Information 


9 


3 


6 


38, 39, 41, 43 


R9* 


Reading for Literary Experience 


9 


3 


6 


32, 33, 35, 45 


RIO 


Reading to Gain Information 


12 


6 


6 


36, 37, 41, 45 



• New blocks for the 1994 assessment. 



Table 2-5 
Booklet Contents 



Booklet 

Number 


Common 

Background 

Block 


Cognitive Blocks 


Reading 

Background 

Block 


Reading 

Motivation 

Block 


1st 


2nd 


R1 


B1 


R4 


R3 


R2 


RB 


R2 


B1 


R3 


R5 


R2 


RB 


R3 


B1 


R5 


R9 


R2 


RB 


R4 


B1 


R9 


R4 


R2 


RB 


R5 


B1 


R4 


R5 


R2 


RB 


R6 


B1 


R3 


R9 


R2 


RB 


R7 


B1 


R6 


RIO 


R2 


RB 


R8 


B1 


RIO 


R7 


R2 


RB 


R9 


B1 


R7 


R8 


R2 


RB 


RIO 


B1 


R8 


R6 


R2 


RB 


Rll 


B1 


R6 


R7 


R2 


RB 


R12 


B1 


RIO 


R8 


R2 


RB 


R13 


B1 


R7 


R4 


R2 


RB 


R14 


B1 


R8 


R3 


R2 


RB 


R15 


B1 


R5 


R6 


R2 


RB 


R16 


B1 


R9 


RIO 


R2 


RB 



31 




The Panel recommended a focused study that addressed the relationship between 
student achievement and instructional practices. The policy issues, items, and field test results 
were reviewed by the group of external consultants who identified specific items to be included 
in the final questionnaires. In addition, the Reading Instrument Development Panel and state 
representatives were consulted on the appropriateness of issues addressed in the questionnaires 
as they relate to reading instruction and achievement. The items underwent internal ETS 
review procedures to ensure fairness and quality and were then assembled into questionnaires. 



2.8.1 Student Questionnaires 

In addition to the cognitive questions, the 1994 Trial State Assessment included three 
student questionnaires. Two of these were five-minute sets of general and reading background 
questions designed to gather contextual information about students, their instructional and 
recreational experiences in reading, and their attitudes toward reading. The third, a three- 
minute questionnaire, was given to students at the end of each booklet to determine students’ 
motivation in completing the assessment and their familiarity with assessment tasks. In order to 
ensure that aU fourth-grade students understood the questions and had every opportunity to 
respond to them, the three questionnaires were read aloud by administrators as students read 
along and responded in their booklets. 

The student demographics (common core) questionnaire (22 questions) included 
questions about race/ethnicity, language spoken in the home, mother’s and father’s level of 
education, reading materials in the home, homework, attendance, which parents live at home, 
and which parents work. This questionnaire was the first section in every booklet. In many 
cases the questions used were continued from prior assessments, so as to document changes in 
contextual factors that occur over time. 

Three categories of information were represented in the second five-minute section of 
reading background questions called the student reading questionnaire (14 questions): 

Time Spent Studying Reading: Students were asked to describe both the amount of 
instruction they received in reading and the time spent on reading homework. 

Instructional Practices: Students were asked to report their instructional experiences 
related to reading in the classroom, including group work, special projects, and 
writing in response to reading. In addition, they were asked about the instructional 
practices of their reading teachers and the extent to which the students themselves 
discussed what they read in class and demonstrated use of skills and strategies. 

Attitudes Towards Reading: Students were asked a series of questions about their 
attitudes and perceptions about reading, such as whether they enjoyed reading and 
whether they were good in reading. 

The student motivation questionnaire (5 questions) asked students to describe how hard 
they tried on the NAEP reading assessment, how difficult they found the assessment, how many 
questions they thought they got right, how important it was for them to do well, and how 
familiar they were with the assessment format. 



2.8.2 Teacher, School, and lEP/LEP Student Questionnaires 



To supplement the information on instruction reported ty students, the reading teachers 
of the fourth graders participating in the Trial State Assessment were asked to complete a 
questionnaire about their instructional practices, teaching backgrounds, and characteristics. The 
teacher questionnaire contained two parts. The first part pertained to the teachers’ background 
and general training. The second part pertained to specific training in teaching reading and the 
procedures the teacher uses for each class containing an assessed student. 

The Teacher Questionnaire, Part I: Background and General Training (25 questions) 
included questions pertaining to gender, race/ethnicity, years of teaching experience, 
certification, degrees, major and minor fields of study, course work in education, course work in 
specific subject areas, amount of in-service training, extent of control over instructional issues, 
and availability of resources for their classroom. 

The Teacher Questionnaire, Part II: Training in Reading and Classroom Instructional 
Information (46 questions) included questions on the teacher’s exposure to various issues 
related to reading and teaching reading through pre- and in-service training, ability level of 
students in the class, whether students were assigned to the class by ability level, time on task, 
homework assignments, frequency of instructional activities used in class, methods of assessing 
student progress in reading, instructional emphasis given to the reading abilities covered in the 
assessment, and use of particular resources. 

A School Characteristics and Policies Questionnaire was given to the principal or other 
administrator of each school that participated in the trial state assessment program. This 
information provided an even broader picture of the instructional context for students’ reading 
achievement. This questionnaire (64 questions) included questions about background and 
characteristics of school principals, length of school day and year, school enrollment, 
absenteeism, drop-out rates, size and composition of teaching staff, policies about grouping 
students, curriculum, testing practices and uses, special priorities and school-wide programs, 
availability of resources, special services, community services, policies for parental involvement, 
and school-wide problems. 

The lEP/LEP Student Questionnaire was completed by the teachers of those students 
who were selected to participate in the trial state assessment sample but who were determined 
by the school to be ineligible to be assessed. In order to be excluded from the assessment, 
students must have had an Individualized Education Plan (lEP) and had not mainstreamed at 
least 50 percent of the time or were categorized as Limited English Proficient (LEP). In 
addition, the school staff would have needed to determine that it was inappropriate to include 
these students in the assessment. This questionnaire asked about the nature of the student’s 
exclusion and the special programs in which the student participated. 



2.9 DEVELOPMENT OF FINAL FORMS 

The field tests of new items for the 1994 assessment were conducted in February and 
March 1993 and involved 6,800 students in 233 schools in 23 states, the District of Columbia, 



and three territories. The intent of the field test was to try out the items and procedures and to 
give the states and the contractors practice and experience with the proposed materials and 
procedures. About 500 responses were obtained to each item in the field test. 

The field test data were collected, scored, and analyzed in preparation for meetings with 
the Reading Instrument Development Panel. Four objectives guided these reviews: to 
determine which items were most suitable for assessing reading comprehension in accordance 
with the framework; to determine the need for revisions of items that lacked clarity, or had 
ineffective item formats; to prioritize items to be included in the Trial State Assessment; and to 
determine appropriate timing for assessment items. Committee members, ETS test 
development staff, and NAEP/ETS staff reviewed the materials. Item analyses (which provided 
the mean percentage of correct responses, the r-biserial correlations, and the difficulty level for 
each item) were used as a guide in identifying and flagging for further review those test items 
that were not measuring the intended objective well. In addition, another meeting of 
representatives from state education agencies was convened to review the field test results. 

Once the committees had selected the items, ail items were rechecked for content, 
measurement, and sensitivity concerns. The federal clearance process was initiated in June 1993 
with the submission of draft materials to NCES. The final package containing the final set of 
cognitive items assembled into blocks and questionnaires was submitted in August 1993. 
Throughout the clearance process, revisions were made in accordance with changes required by 
the government. Upon approval, the blocks (assembled into booklets) and questionnaires were 
ready for printing in preparation for the assessment. 



34 



52 




Chapter 3 

SAMPLE DESIGN AND SELECTION 



James L. Green, John Burke, and Keith F. Rust 
Westat, Inc. 



3.1 OVERVIEW 

For the 1994 Trial State Assessment in reading, a combined sample of approximately 
2,800 fourth-grade public- and nonpublic-school students was assessed in most jurisdictions. 

Each sample was designed to produce aggregate estimates as well as estimates for various 
subpopulations of interest with approximately equal precision for the participating jurisdictions. 
In most of the jurisdictions, about 2,500 students from approximately 100 public schools were 
assessed. The nonpublic-school sample sizes were more varied, usually from about 100 to 500 
students in up to 22 nonpublic schools. The tables in Appendix B provide more detailed 
information about participation rates for schools and students. 

The target population for the 1994 Trial State Assessment Program included students in 
public and nonpublic schools who were enrolled in the fourth grade at the time of assessment. 
The sampling frame included public and nonpublic schools having the relevant grade in each 
jurisdiction. The samples were selected based on a two-stage sample design; selection of schools 
within participating jurisdictions, and selection of students within schools. The first-stage 
samples of schools were selected with probability proportional to the fourth-grade enrollment in 
the schools. Special procedures were used for jurisdictions with many small schools, and for 
jurisdictions having small numbers of grade-eligible schools. 

The sampling frame for each jurisdiction was first stratified by urbanization status of the 
area in which the school was located. The urbanization classes were defined in terms of large or 
midsize central city, urban fringe of large or midsize city, large town, small town, and rural 
areas. Within urbanization strata, schools were further stratified explicitly on the basis of 
minority enrollment in those jurisdictions with substantial Black or Hispanic student population. 
Minority enrollment was defined as the total percent of Black and Hispanic students enrolled in 
a school. Within minority strata, schools were sorted by median household income of the ZIP 
Code area where the school was located. 

A systematic random sample of about 100 fourth-grade schools was drawn with 
probability proportional to the fourth-grade enrollment of the school from the stratified frame 
of schools within each jurisdiction. Each selected school provided a list of eligible enrolled 
students, from which a systematic sample of students was drawn. One session of 30 students 
was sampled within each school, except in Delaware, where as many as three sessions were 



sampled within a given school. The number of sessions (i.e., multiples of 30 students) selected 
in each Delaware school was proportional to the fourth-grade enrollment of the school. Overlap 
between the 1994 state and national samples was minimized. 

For jurisdictions that had participated in the 1992 Trial State Assessment, 25 percent of 
their selected public schools were designated at random to be monitored during the assessment 
so that reliable comparisons could be made between sessions administered with and without 
monitoring. For jurisdictions that had not participated in the previous assessment, 50 percent of 
their selected public schools were designated to be monitored. Fifty percent of all nonpublic 
schools were designated to be monitored, regardless of whether or not the jurisdiction had 
previously participated. 

The 1994 assessment was preceded in 1993 by a field test. The principal goals of the 
field test were to test procedures and new items contemplated for 1994. Furthermore, three 
states and one territory used the field test to observe and react to proposed strategies. Twenty- 
four states participated in the field test. Schools that participated in the field test were given a 
chance of selection in the 1994 assessment. Section 3.2 documents the procedures used to select 
the schools for the field test. 

Section 3.3 describes the construction of the sampling frames, including the sources of 
school data, missing data problems, and definition of in-scope schools. Section 3.4 includes a 
description of the various steps in stratification of schools within participating jurisdictions. 
School sample selection procedures (including new and substitute schools) are described in 
section 3.5. Section 3.6 includes the steps involved in selection of students within participating 
schools. 



32 SAMPLE SELECTION FOR THE 1993 FIELD TEST 

The Trial State Assessment 1993 field test was conducted together with the field test for 
the national portion of the assessment. Twenty-four states participated in the field test, which 
was conducted for grades 4, 8, and 12. Pairs of schools were identified, with one of each pair to 
be included in the test. This allowed state participation in the selection of the test schools and 
also facilitated replacement of schools that declined to pauicipate in the assessment. Sampling 
weights were not computed for the field test samples. 



32.1 Primary Sampling Units 

The sampling frame for the field test primary sampling units (PSUs) was derived from 
the national frame of NAEP PSUs'. The 60 national frame PSUs that were noncertainty 
selections for the 1992 national NAEP were excluded from the field test sampling frame. 
National frame PSUs in the District of Columbia, Delaware, Hawaii, New Hampshire, Rhode 



'The frame of NAEP PSUs was the frame used to draw the national NAEP samples for 1986 to 1992. Refer to the 
1990 national technical report (Johnson & Allen, 1992) for more information. 



Island, and Wyoming were excluded due to the heavy burden these states experience in the 
national and state assessments. National frame PSUs in Alaska were excluded to control field 
test costs. 

One hundred PSUs were selected from the resulting field test frame. Forty PSUs were 
selected with certainty and 60 noncertainty PSUs were selected, one per noncertainty stratum. 
The PSUs were selected systematically and with probability proportional to the 1980 PSU 
general population using a starting point that was selected to avoid overlap with PSUs selected 
for the national assessment studies from 1986 to 1992. 



'i22 Selection of Schools and Students 

Public, private, and Catholic schools with fourth-, eighth- or twelfth-grade students were 
in scope for the field test assessment. Schools with fewer than 25 fourth graders were 
eliminated from the frame to avoid the relatively high per student cost of conducting 
assessments in small schools. For the same reason, schools with fewer than 40 eighth or twelfth 
graders were eliminated from the frame. Schools that were selected in the 1992 national and 
state NAEP samples were eliminated from the frame to avoid undue burden. 

Three hundred pairs of schools were selected for each grade from the resulting frame by 
selecting two to eight pairs of schools within each of the 100 PSUs. In each of the 60 
noncertainty PSUs two pairs of schools were selected. In the 40 certainty PSUs, from two to 
eight pairs were selected in proportion to the size of the PSU. The first member of each pair 
was selected systematically and with probability proportional to grade enrollment. The twelfth- 
grade sample was drawn first, followed by the eighth- and fourth-grade samples. Each school 
selected for a grade was removed from the frame before the next grade’s schools were drawn. 

In this way, no school was selected for more than one grade. 

The second member of each pair was selected in such a way that the "distance" from the 
primary selection was the smallest across all schools in the sampling frame that were not 
selected for the fourth-, eighth- and twelfth-grade samples. The distance measure was a function 
of the perce.it of Black students, percent of Hispanic students, grade enrollment, and percent of 
students living below poverty. 



3.23 Assignment to Sessions for Different Subjects 

Six to eight different session types were assigned in a given state. The particular number 
of session types varied by grade and no individual school held more than three sessions. Table 
3-1 gives the overall number of sessions assigned by grade and session type. 



5v3 



37 





Table 3-1 

Number of Sessions by Grade and Session Type 



Session l^pe 


Grade 4 


Grade 8 


Grade 12 


Reading 


44 


106 


103 


Mathematics 


87 


85 


82 


Mathematics Estimation 


22 


21 


21 


Science 


154 


148 


145 


U.S. History 


124 


190 


168 


Geography 


122 


147 


147 


Advanced Mathematics 


- 


53 


40 


Advanced Science 


- 


- 


60 


Totals 


553 


750 


766 



The number of sessions assigned to an individual school depended on the size of the 
school and was determined as follows: 

Grade 4: 3 sessions for the 50 largest schools 

2 sessions for all other schools with 55 or more students 

1 session for all other schools 

Grade 8: 2 sessions for the 90 smallest schools 

3 sessions for all other schools 

Grade 12: 3 sessions for the 180 largest schools 

2 sessions for all others. 



3 J SAMPLING FRAME FOR THE 1994 ASSESSMENT 

3J.1 Choice of School Sampling Frame 

In order to draw the school samples for the 1994 Trial State Assessment, it was 
necessary to obtain a comprehensive list of public and nonpublic schools in each jurisdiction. 

For each school, useful information for stratification purposes, reliable information about grade 
span and enrollment, and accurate information for identifying the school to the state coordinator 
(district membership, name, address) were required. 

Based on experience with the 1992 Trial State Assessment and national assessments from 
1984 to 1992, the file made available by Quality Education Data, Inc. (QED) was elected as the 
sampling frame. The National Center for Education Statistics’ Common Core of Data (CCD) 
school file was used to check the completeness of the QED file. This approach was the same as 
that used to develop frames for the 1992 Trial State Assessment. 



38 




5b’ 



The QED list covers all jurisdictions except Puerto Rico. The version of the QED file 
used was released in late 1992, in time for selection of the school sample in early 1993. The file 
was missing minority and urbanization data for a sizable minority of schools (due to the inability 
of QED to match these schools with the corresponding CCD file). Considerable efforts were 
undertaken to obtain these variables for all schools in jurisdictions where these variables were to 
be used for stratification. These efforts are described in the next section. 

Table 3-2 shows the distribution of fourth-grade schools and enrollment within schools as 
reported in the 1992 QED file. Enrollment was estimated for each grade as the ratio of total 
school enrollment divided by the number of grades in the school. 



33J2 Missing Stratification 

As stated earlier, the sampling frame for the 1994 Trid State Assessment was the most 
recent version of the QED file as of January 1993. The CCD file was used to extract 
information on urbanization (type of location) and minority enrollment in the cases where these 
variables were missing on the QED file. For public schools, missing values remaining in 
urbanization or minority enrollment data were imputed. 

Schools with missing values in urbanization data were assigned the urbanization of other 
school records within the same jurisdiction, county, and city when urbanization did not vary 
within the given city. Any schools still missing urbanization were assigned values from the CCD 
file, where possible, or were assigned the modal value of urbanization within their city. Any 
remaining missing values were assigned individually based on city and Census publications. 

Schools with missing values in minority enrollment data were assigned the average 
minority enrollment within their school district. Any schools still missing minority enrollment 
data were assigned values individually using ZIP codes and Census data. The minority data 
were extracted only for those schools in jurisdictions in which minority stratification was 
performed. 

Metro status was assigned to each nonpublic school based on Census definitions as of 
December 31, 1992 and FIPS county code. The QED school type was used to assign Catholic 
school status to nonpublic schools. Values for metro status and Catholic school status were 
found for all schools in the frame. 

Median income was assigned to every school in the sampling frame by merging on ZIP 
code with a file from Donnelly Marketing Information Services. Any schools still missing 
median income were assigned the mean value of median income for the three-digit ZIP code 
prefix or county within which they were located. 



Table 3-2 

Distribulioa of Fourth-grade Schools and Enrollment as Reported in QED 1992 





Public Schools 


Noupublic Schools 


Jurisdiction 


Total Schools 


Total EurollmeDt 


Total Schools 


Total Enrollment 


Alabama 


770 


59833 


211 


4639 


Alaska 


352 


9666 


52 


476 


Arizona 


663 


55734 


200 


3581 


Arkansas 


542 


35222 


111 


1864 


California 


4727 


422095 


1962 


45740 


Colorado 


759 


49556 


183 


3144 


Connecticut 


565 


39660 


205 


4500 


Delaware 


53 


7207 


66 


1762 


District of Columbia 


116 


6234 


41 


1215 


Florida 


1413 


164279 


740 


17588 


Georgia 


1015 


98414 


327 


7506 


Hawaii 


173 


14891 


82 


2740 


Idaho 


316 


18995 


55 


671 


Illinois 


2334 


144400 


936 


25159 


Indiana 


1152 


75809 


474 


8575 


Iowa 


781 


38813 


202 


4474 


Kansas 


822 


36986 


165 


3193 


Kentucky 


819 


50820 


201 


5448 


Louisiana 


789 


64458 


319 


10883 


Maine 


405 


17438 


78 


904 


Maryland 


771 


56587 


308 


8236 


Massachusetts 


1031 


68314 


333 


8438 


Michigan 


1873 


126676 


749 


16049 


Minnesota 


844 


62300 


420 


7791 


Mississippi 


458 


40967 


138 


3805 


Missouri 


1092 


66583 


424 


10027 


Montana 


468 


12885 


62 


615 


Nebraska 


948 


22132 


171 


3267 


Nevada 


234 


18140 


34 


700 


New Hampshire 


265 


15156 


68 


1131 


New Jersey 


1319 


88171 


567 


15005 


New Mexico 


384 


26206 


128 


1809 


New York 


2249 


197261 


1281 


37925 


North Carolina 


1113 


87415 


232 


5436 


North Dakota 


343 


9875 


58 


840 


Ohio 


2016 


139722 


721 


20593 


Oklahoma 


953 


49375 


92 


1794 


Oregon 


753 


40374 


174 


2395 


Pennsylvania 


1870 


131024 


1220 


28292 


Rhode Island 


179 


11466 


68 


1704 


South Carolina 


553 


50842 


198 


4058 


South Dakota 


395 


11245 


103 


1247 


Tennessee 


926 


69647 


223 


4973 


Texas 


3124 


282576 


666 


15363 


Utah 


433 


37681 


31 


582 


Vermont 


251 


7926 


39 


397 


Virginia 


1034 


83093 


301 


6072 


Washington 


1034 


71984 


336 


5289 


West Virginia 


593 


246S8 


98 


1160 


Wisconsin 


1152 


63161 


747 


13412 


Wyoming 


233 


8345 


24 


232 


Total 


47457 


3392327 


16624 


382699 



40 




58 



3J3 In-scope Schools 



The target population for the 1994 fourth-grade Trial State Assessment in reading 
included students in regular public and nonpublic schools who were enrolled in the fourth grade. 
Nonpublic schools include parochial schools, private schools, Bureau of Indian Affairs schools 
and Domestic Department of Defense Education Activity schools. Special education schools 
were not included. 



3.4 STRATIFICATION WITHIN JURISDICTIONS 
3.4.1 Stratification Variables 

Selection of schools within participating jurisdictions involved two stages of explicit 
stratification and one stage of implicit stratification. The two explicit stages for public schools 
were urbanization and minority enrollment. The two explicit stages for nonpublic schools were 
metro status and school type. The final stage for both public and nonpublic schools was median 
income. 



3.4.2 Urbanization Classification 

The NCES "type of location" variable was used to stratify fourth-grade schools into seven 
different urbanization levels: 

Large Central City, a central city of a Metropolitan Statistical Area (MSA) with a 
population greater than or equal to 400,000, or a population density greater than or 
equal to 6,000 persons per square mile; 

Midsize Central City: a central city of an MSA but not designated as a large central 
city; 



Urban Fringe of Large Central City: a place within an MSA of a large central city 
and defined as urban by the U.S. Bureau of Census; 

Urban Fringe of Midsize Central City: a place within an MSA of a midsize central city 
and defined as urban by the U.S. Bureau of Census; 

Large Town: a place not within an MSA, but with a population greater than or 
equal to 25,000 but greater than 50,000 and defined as urban by the U.S. Bureau of 
Census; 

Small Town: a place not within an MSA, with a population less than 25,000, but 
greater than 2,499 and defined as urban by U.S. Bureau of Census; and 

Rural: a place with a population of less than 2,500 and defined as rural by the U.S. 
Bureau of the Census. 



The urbanization strata were created by collapsing type of location categories. The 
nature of the collapsing varied across jurisdictions and grades. Each urbanization stratum 
included a minimum of 10 percent of eligible students in the participating jurisdiction. Table 3-3 
provides the urbanization .categories (created by collapsing type of location) used within each 
jurisdiction. 

3.4J Minority Classification 

The second stage of stratification was minority enrollment. Minority enrollment strata 
were formed within urbanization strata, based on percentages of Black and Hispanic students. 
The three cases that occur are described in the following paragraphs. 

Case 1: Urbanization strata with less than 10 percent Black students and 7 percent 
Hispanic students were not stratified by minority enrollment. 

Case 2: Urbanization strata with greater than or equal to 10 percent Black students 
or 7 percent Hispanic students, but not more than twenty percent of each, were 
stratified by ordering percent minority enrollment within the urbanization classes 
and dividing the schools into three groups with about equal numbers of students per 
minority group. 

Case 3: In urbanization strata with greater than 20 percent of both Black and 
Hispanic students, minority strata were formed with the objective of providing equal 
strata with emphasis on the minority group (Black or Hispanic) with the higher 
concentration. The stratification was performed as follows. The minority group with 
the higher percentage gave the primary stratification variable; the remaining group 
gave the secondary stratification variable. Within urbanization class, the schools 
were sorted based on the primary stratification variable and divided into two groups 
of schools containing approximately equal numbers of students. Within each of 
these two groups, the schools were sorted by the secondary stratification variable and 
subdivided into two subgroups of schools containing approximately equal numbers 
of students. As a result, within urbanization strata there were four minority groups 
(e.g., low Black/low Hispanic, low Black/high Hispanic, high Black/low Hispanic, 
and high Black/high Hispanic). 

The cutpoints in minority enrollment used to classify urbanization strata into these three 
cases were developed empirically. They ensure that there is good opportunity to stratify by race 
and ethnicity, without creating very small strata that would lead to sampling inefficiency. 

The minority groups were formed solely for the purpose of creating efficient 
stratification design at this stage of sampling. These classifications were not used directly in 
analysis and reporting of the data, but acted to reduce sampling errors for achievement-level 
estimates. Table 3-3 provides information on minority stratification for the participating 
jurisdictions. 



42 



O 

ERIC 



60 



Table 3-3 

Distribution of the Selected Public Schools by Sampling Strata 



Urbanization 



Originally 

Mincritv Selected Schools 



Alabama 

Midsize Central City 

Midsize Central City 

Midsize Central City 

Urban Fringe of Midsize Central City 

Urban Fringe of Midsize Central City 

Urban Fringe of Midsize Central City 

Large/Small Town 

Large/Small Town 

Large/Small Town 

Rural 

Rural 

Rural 

Arizona 

Large Central City 
Large Central City 
Large Central City 
Midsize Central Cit> 

Midsize Central City 
Midsize Central City 

Urban Fringe of Large/Midsize Central City 
Urban Fringe of I-arge /Midsize Central City 
Urban Fringe of Large /Midsize Central City 
Large/Small Town and Rural 
Large/Small Town and Rural 
Large/Snall Town and Rural 

Arkansas 

Midsize Central City + Urban Fringe 

Midsize Central City + Urban Fringe 

Midsize Central City + Urban Fringe 

Large/Small Town 

Large/Small Town 

Large/Small Town 

Rural 

Rural 

Rural 



Low Percent Minority 


10 


Medium Percent Minority 


10 


High Percent Minority 


10 


Low Percent Minority 


8 


Medium Percent Minority 


9 


High Percent Minority 


8 


Low Percent Minority 


9 


Medium Percent Minority 


9 


High Percent Minority 


9 


Low Percent Minority 


8 


Medium Percent Minority 


9 


High Percent Minority 


8 




107 


Low Percent Minority 


9 


Medium Percent Minority 


9 


High Percent Minority 


9 


Low Percent Minority 


11 


Medium Percent Minority 


11 


High Percent Minority 


11 


Low Percent Minority 


5 


Medium Percent Minority 


6 


High Percent Minority 


5 


Low Percent Minority 


9 


Medium Percent Minority 


10 


High Percent Minority 


JO 




105 


Low Percent Minority 


9 


Medium Percent Minority 


10 


High Percent Minority 


10 


Low Percent Minority 


16 


Medium Percent Minority 


15 


High Percent Minority 


15 


Low Percent Minority 


11 


Medium Percent Minority 


10 


High Percent Minority 


JI 




107 



Table 3-3 (continued) 

Distribution of the Selected Public Schools by Sampling Strata 



Urbanization 

California 

Large Central City 

Large Central City 

Large Central City 

Midsize Central City 

Midsize Central City 

Midsize Central City 

Urban Fringe of Large Central City 

Urban Fringe of Large Central City 

Urban Fringe of Large Central City 

Urban Fringe of Midsize Central City 

Urban Fringe of Midsize Central City 

Urban Fringe of Midsize Central City 

Large/Small Town and Rural 

Large/Small Town and Rural 

Large/Small Town and Rural 

Colorado 

Large/Midsize Central City 

Large/Midsize Central City 

Large/Midsize Central City 

Urban Fringe of Large/Midsize Central City 

Urban Fringe of Large/Midsize Central City 

Urban Fringe of Large/Midsize Central City 

Large/Small Town 

Large/Small Town 

Large/Small Town 

Rural 

Rural 

Rural 

Connecticut 

Large Central City 

Large Central City 

Large Central City 

Large Central City 

Midsize Central City 

Midsize Central City 

Midsize Central City 

Urban Fringe of Large Central City 

Urban Fringe of Midsize Central City 

Large/Small Town and Rural 



Originally 

Minority Selected Schools 



Low Percent Minority 


7 


Medium Percent Minority 


8 


High Percent Minority 


7 


Low Percent Minority 


7 


Medium Percent Minority 


7 


High Percent Minority 


7 


Low Percent Minority 


10 


Medium Percent Minority 


11 


High Percent Minority 


11 


Low Percent Minority 


3 


Medium Percent Minority 


4 


High Percent Minority 


3 


Low Percent Minority 


7 


Medium Percent Minority 


7 


High Percent Minority 


Jl 

106 


Low Percent Minority 


12 


Medium Percent Minority 


11 


High Percent Minority 


11 


Low Percent Minority 


13 


Medium Percent Minority 


13 


High Percent Minority 


12 


Low Percent Minority 


6 


Medium Percent Minority 


7 


High Percent Minority 


6 


Low Percent Minority 


6 


Medium Percent Minority 


5 


High Percent Minority 


108 


Low Black/Low Hispanic 


4 


Low Black/High Hispanic 


5 


High Black/Low Hispanic 


4 


High Black/High Hispanic 


5 


Low Percent Minority 


6 


Medium Percent Minority 


8 


High Percent Minority 


6 


None 


17 


None 


13 


None 


Jl 

105 



44 





Table 3-3 (continued) 

Distribution of the Selected Public Schools by Sampling Strata 



Urbanization 



Originally 

Minority Selected Schools 



Delaware 

Midsize Central City 

Midsize Central City 

Midsize Central City 

Urban Fringe of Midsize Central City 

Urban Fringe of Midsize Central City 

Urban Fringe of Midsize Central City 

Small Town 

Small Town 

Small Town 

Rural 

Rural 

Rural 

District Of Columbia 
Large Central City 
Large Central City 
Large Central City 

Florida 

Large Central City 
Large Central City 
Large Central City 
Large Central City 
Midsize Central City 
Midsize Central City 
Midsize Central City 

Urban Fringe of Large/Midsize Central City 
Urban Fringe of Large/Midsize Central City 
Urban Fringe of Large/Midsize Central City 
Large/Small Town and Rural 
Large/Small Town and Rural 
Large/Small Town and Rural 

Georgia 

Large/Midsize Central City 

Large /Midsize Central City 

Large /Midsize Central City 

Urban Fringe of Large/Midsize Central City 

Urban Fringe of Large /Midsize Central City 

Urban Fringe of Large /Midsize Central City 

Small Town 

Small Town 

Small Town 

Rural 

Rural 

Rural 

Guam 

None 



Low Percent Minority 


5 


Medium Percent Minority 


6 


High Percent Minority 


5 


Low Percent Minority 


3 


Medium Percent Minority 


1 


High Percent Minority 


2 


Low Percent Minority 


4 


Medium Percent Minority 


2 


High Percent Minority 


1 


Low Percent Minority 


9 


Medium Percent Minority 


7 


High Percent Minority 


J 

53 


Low Percent Minority 


39 


Medium Percent Minority 


38 


High Percent Minority 


116 


Low Black/Low Hispanic 


4 


Low Black/High Hispanic 


4 


High Black/Low Hispanic 


4 


High Black/High Hispanic 


4 


Low Percent Minority 


12 


Medium Percent Minority 


12 


High Percent Minority 


12 


Low Percent Minority 


12 


Medium Percent Minority 


11 


High Percent Minority 


11 


Low Percent Minority 


7 


Medium Percent Minority 


6 


High Percent Minority 


106 


Low Percent Minority 


8 


Medium Percent Minority 


9 


High Percent Minority 


8 


Low Percent Minority 


10 


Medium Percent Minority 


9 


High Percent Minority 


10 


Low Percent Minority 


11 


Medium Percent Minority 


10 


High Percent Minority 


10 


Low Percent Minority 


6 


Medium Percent Minority 


7 


High Percent Minority 


105 


None 


21 



45 




63 



Table 3-3 (continued) 

Distribution of the Selected Public Schools by Sampling Strata 



Urbanization 



Originally 

Minority Selected Schools 



Hawaii 

Midsize Central City 

Urban Fringe of Midsize Central City 

Small Town and Rural 

Idaho 

Midsize Central City and Urban Fringe 
Large Town 
Small Town 
Rural 

Indiana 

Midsize Central City 
Midsize Central City 
Midsize Central City 

Urban Fringe of Large /Midsize Central City 
Urban Fringe of Large/Midsize Central City 
Urban Fringe of Large/Midsize Central City 
Large/Small Town 
Rural 

Iowa 

Midsize Central City and Urban Fringe 

Large/Small Town 

Rural 

Kentucky 

Midsize Central City and Urban Fringe 
Midsize Central City and Urban Fringe 
Midsize Central City and Urban Fringe 
Large/Small Town 
Rural 

Louisiana 
Large Central City 
Large Central City 
Large Central City 
Midsize Central City 
Midsize Central City 
Midsize Central City 

Urban Fringe of Large/Midsize Central City 

Urban Fringe of Large/Midsize Central City 

Urban Fringe of Large/Midsize Central City 

Large/Small Town 

Large/Small Town 

Large/Small Town 

Rural 

Rural 

Rural 



None 


32 


None 


50 


None 


23 




105 


None 


21 


None 


18 


None 


34 


None 


J6 




109 


Low Percent Minority 


9 


Medium Percent Minority 


9 


High Percent Minority 


9 


Low Percent Minority 


9 


Medium Percent Minority 


8 


High Percent Minority 


9 


None 


32 


None 


-21 




106 


None 


36 


None 


36 


None 


J7 




109 


Low Percent Minority 


11 


Medium Percent Minority 


11 


High Percent Minority 


11 


None 


37 


None 


J7 




107 


Low Percent Minority 


3 


Medium Pcl’cent Minority 


4 


High Percent Minority 


4 


Low Percent Minority 


9 


Medium Percent Minority 


8 


High Percent Minority 


8 


Low Percent Minority 


6 


Medium Percent Minority 


6 


High Percent Minority 


6 


Low Percent Minority 


11 


Medium Percent Minority 


11 


High Percent Minority 


10 


Low Percent Minority 


6 


Medium Percent Minority 


7 


High Percent Minority 


_6 



105 



46 



64 



Table 3-3 (continued) 

Distribution of the Selected Public Schools by Sampling Strata 



Urbanization 



Minority 



Originally 
Selected Schools 



Maine 






Midsize Central City and Urban Fringe 


None 


20 


Small Town 


None 


56 


Rural 


None 


J9 






115 


Maryland 






Large/Midsize Central City 


Low Percent Minority 


6 


Large/Midsize Central City 


Medium Percent Minority 


6 


Large/Midsize Central City 


High Percent Minority 


7 


Urban Fringe of Large/Midsize Central City 


Low Percent Minority 


20 


Urban Fringe of Large/Midsize Central City 


Medium Percent Minority 


21 


Urban Fringe of Large/Midsize Central City 


High Percent Minority 


19 


Large/Small Town and Rural 


Low Percent Minority 


8 


Large/Small Town and Rural 


Medium Percent Minority 


9 


Large/Small Town and Rural 


High Percent Minority 


_9 






105 


Massachusetts 






Large/Midsize Central City 


Low Percent Minority 


11 


Large/Midsize Central City 


Medium Percent Minority 


12 


Large/Midsize Central City 


High Percent Minority 


11 


Urban Fringe of Large/Midsize Central City 


None 


33 


Large/Small Town and Rural 


None 


J8 






105 


Michigan 






Large/Midsize Central City 


Low Percent Minority 


9 


Large/Midsize Central City 


Medium Percent Minority 


8 


Large/Midsize Central City 


High Percent Minority 


8 


Urban Fringe of Large Central City 


None 


24 


Urban Fringe of Midsize Central City 


Low Percent Minority 


4 


Urban Fringe of Midsize Central City 


Medium Percent Minority 


4 


Urban Fringe of Midsize Central City 


High Percent Minority 


4 


Large/Small Town 


None 


29 


Rural 


None 


J6 






106 


Minnesota 






Large/Midsize Central City 


Low Percent Minority 


5 


Large/Midsize Central City 


Medium Percent Minority 


4 


Large/Midsize Central City 


High Percent Minority 


5 


Urban Fringe of Large/Midsize Central City 


None 


36 


Large/Small Town 


None 


25 


Rural 


None 


JZ 






107 



47 






o 

ERIC 



Table 3-3 (continued) 

Distribution of the Selected Public Schools by Sampling Strata 



Urbanization 

Mississippi 

Midsize Central City 
Midsize Central City 
Midsize Central City 

Urban Fringe of Large/Midsize Central City 

Urban Fringe of Large/Midsize Central City 

Urban Fringe of Large/Midsize Central City 

Large/Small Town 

Large/Small Town 

Large/Small Town 

Rural 

Rural 

Rural 

Missouri 

Large/Midsize Central City 
Large/Midsize Central City 
Large/Midsize Central City 
Urban Fringe of Large/Midsize Central City 
Urban Fringe of Large/Midsize Central City 
Urban Fringe of Large/Midsize Central City 
Large/Small Town 
Rural 

Montana 

Midsize Central City and Urban Fringe 
Large Town 
Small Town 
Rural 

Nebraska 

Midsize Central City and Urban Fringe 
Midsize Central City and Urban Fringe 
Midsize Central City and Urban Fringe 
Large/Small Town 
Rural 

New Hampshire 

Midsize Central City and Urban Fringe 

Large/Small Town 

Rural 



Originally 

Minority Selected Schools 



Low Percent Minority 


5 


Medium Percent Minority 


4 


High Percent Minority 


5 


Low Percent Minority 


4 


Medium Percent Minority 


3 


High Percent Minority 


3 


Low Percent Minority 


16 


Medium Percent Minority 


17 


High Percent Minority 


15 


Low Percent Minority 


11 


Medium Percent Minority 


11 


High Percent Minority 


JA 




105 


Low Percent Minority 


5 


Medium Percent Minority 


5 


High Percent Minority 


6 


Low Percent Minority 


12 


Medium Percent Minority 


12 


High Percent Minority 


11 


None 


28 


None 


_30 




109 


None 


24 


None 


12 


None 


41 


None 


J8 




135 


Low Percent Minority 


15 


Medium Percent Minority 


14 


High Percent Minority 


14 


None 


40 


None 


M 




144 


None 


26 


None 


57 


None 






109 



48 



6G 



o 




Table 3-3 (continued) 

Distribution of the Selected Public Schools by Sampling Strata 



Urbanization 



Minority 



Originally 
Selected Schools 



New Jersey 

Large/Midsize Central City 


Low Black/Low Hi^anic 


6 


Large/Midsize Central City 


Low Black/High Hispanic 


5 


Large/Midsize Central City 


High Black/Low Hispanic 


5 


Large/Midsize Central City 


High Black/High Hispanic 


6 


Urban Fringe of Large Central City 


Low Percent Minority 


13 


Urban Fringe of Large Central City 


Medium Percent Minority 


13 


Urban Fringe of Large Central City 


High Percent Minority 


13 


Urban Fringe of Midsize Central City 


Low Percent Minority 


7 


Urban Fringe of Midsize Central City 


Medium Percent Minority 


6 


Urban Fringe of Midsize Central City 


High Percent Minority 


7 


Large/Small Town and Rural 


None 


-25 

106 


New Mexico 




Midsize Central City and Urban Fringe 


Low Percent Minority 


14 


Midsize Central City and Urban Fringe 


Medium Percent Minority 


14 


Midsize Central City and Urban Fringe 


High Percent Minority 


14 


Large Town 


Low Percent Minority 


5 


Large Town 


Medium Percent Minority 


6 


Large Town 


High Percent Minority 


5 


Small Town 


Low Percent Minority 


11 


Small Town 


Medium Percent Minority 


11 


Small Town 


High Percent Minority 


11 


Rural 


Low Percent Minority 


5 


Rural 


Medium Percent Minority 


6 


Rural 


High Percent Minority 


_6 






108 


New York 

Large/Midsize Central City 


Low Black/Low Hispanic 


12 


Large/Midsize Central City 


Low Black/High Hispanic 


11 


Large/Midsize Central City 


High Black/Low Hispanic 


12 


Large/Midsize Central City 


High Black/High Hispanic 


12 


Urban Fringe of Large /Mid size Central City 


Low Percent Minority 


10 


Urban Fringe of Large/Mid size Central City 


Medium Percent Minority 


10 


Urban Fringe of Large/Mid size Central City 


High Percent Minority 


9 


Large/Small Town and Rural 


None 


29 

105 



49 



G 



o 

ERIC 



Table ?.-3 (continued) 

Distribution of the Selected Public Schools by Sampling Strata 



Urbanization 



Originally 

Minority Sel^ted^choolj 



North Carolina 

Midsize Central City 

Midsize Central City 

Midsize Central City 

Urban Fringe of Midsize Central City 

Urban Fringe of Midsize Central City 

Urban Fringe of Midsize Central City 

Large/Small Toati 

Large/Small Towr. 

Large/Small Town 

Rural 

Rural 

Rural 

North Dakota 

Midsize Central City and Urban Fringe 

Large /Small Town 

Rural 

Pennsylvania 

Large/Midsize Central City 
Large/Midsize Central City 
Large/Midsize Central City 
Urban Fringe of Large/Midsize Central City 
Urban Fringe of Large/Midsize Central City 
Urban Fringe of Large/Midsize Central City 
Large/SmaU Town 
Rural 

Rhode Island 
Large Central City 
Large Central City 
Large Central City 
Large Central City 
Midsize Central City 
Midsize Central City 
Midsize Central City 

Urban Fringe of Large/Midsize Central City 
Large/Small Town and Rural 



Low Percent Minority 


10 


Medium Percent Minority 


11 


High Percent Minority 


10 


Low Percent Minority 


3 


Medium Percent Minority 


4 


High Percent Minority 


3 


Low Percent Minority 


11 


Medium Percent Minority 


11 


High Percent Minority 


11 


Low Percent Minority 


11 


Medium Percent Minority 


11 


High Percent Minority 


J2 




106 


None 


36 


None 


30 


None 


_63 




129 


Low Percent Minority 


7 


Medium Percent Minority 


7 


High Percent Minority 


7 


Low Percent Minority 


11 


Medium Percent Minority 


12 


High Percent Minority 


11 


None 


34 


None 


-IS 




107 


Low Hispanic/Low Black 


5 


Low Hispanic/High Black 


4 


High Hispanic/Low Black 


5 


High Hispanic/High Black 


5 


Low Percent Minority 


4 


Medium Percent Minority 


5 


High Percent Minority 


4 


None 


45 


None 


J9 




106 



50 



Gb 



o 



m 



Tabic 3-3 (contbued) 

Distribution of the Selected Public Schools by Sampling Strata 



Urbanization 



Minority 



Originally 
Selected Scho ols 



South Carolina 






Midsize Central City 


Low Percent Minority 


5 


Midsize Central City 


Medium Percent Minority 


5 


Midsize Central City 


High Percent Minority 


6 


Urban Fringe of Midsize Central City 


Low Percent Minority 


9 


Urban Fringe of Midsize Central City 


Medium Percent Minority 


10 


Urban Fringe of Midsize Central City 


High Percent Minority 


9 


Small Town 


Low Percent Minority 


12 


Small Town 


Medium Percent Minority 


12 


Small Town 


High Percent Minority 


12 


Rural 


Low Percent Minority 


9 


Rural 


Medium Percent Minority 


8 


Rural 


High Percent Minority 








105 


Tennessee 






Large Central City 


Low Percent Minority 


8 


Large Central City 


Medium Percent Minority 


8 


Large Central City 


High Percent Minority 


9 


Midsize Central City 


Low Percent Minority 


5 


Midsize Central City 


Medium Percent Minority 


5 


Midsize Central City 


High Percent Minority 


4 


Urban Fringe of Large/Midsize Central City 


Low Percent Minority 


5 


Urban Fringe of Large/Midsize Central City 


Medium Percent Minority 


5 


Urban Fringe of Large/Midsize Central City 


High Percent Minority 


5 


Large/Smail Town 


Low Percent Minority 


10 


Large/Small Town 


Medium Percent Minority 


10 


Large/Small Town 


High Percent Minority 


10 


Rural 


None 


22 






106 


Texas 






Large Central City 


Low Hispanic/Low Black 


7 


Large Central City 


Low Hispanic/High Black 


7 


Large Central City 


High Hispanic/Low Black 


8 


Large Central City 


High Hispanic/High Black 


7 


Midsize Central City 


Low Percent Minority 


9 


Midsize Central City 


Medium Percent Minority 


9 


Midsize Central City 


High Percent Minority 


9 


Urban Fringe of Large/Midsizc Central City 


Low Percent Minority 


5 


Urban Fringe of Large/Midsize Central City 


Medium Percent Minority 


4 


Urban Fringe of Large/Midsize Central City 


High Percent Minority 


5 


Small Town 


Low Percent Minority 


7 


Small Town 


Medium Percent Minority 


8 


Small Town 


High Percent Minority 


8 


Rural 


Low Percent Minority 


5 


Rural 


Medium Percent Minority 


4 


Rural 


High Percent Minority 


_5 



51 



107 






ERIC 



Table 3-3 (continued) 

Distribution of the Selected Public Schools by Sampling Strata 



Urbanization 



Originally 

Minority Selected Schools 



Utah 

Midsize Central City 

Urban Fringe of Midsize Central City 

Large/Small Town 

Rural 

Virginia 

Midsize Central City 
Midsize Central City 
Midsize Central City 

Urban Fringe of Large/Midsize Central City 

Urban Fringe of Large/Midsize Central City 

Urban Fringe of Large/Midsize Central City 

Large/Small Town 

Large/Small Town 

Large/Small Town 

Rural 

Rural 

Rural 

Washington 

Large/Midsize Central City 

Urban Fringe of Large/Midsize Central City 

Large/Small Town 

Large/Small Town 

Large/Small Town 

Rural 

West Virginia 

Midsize Central City 

Urban Fringe of Midsize Central City 

Large/Small Town 

Rural 

Wisconsin 
Large Central City 
Large Central City 
Large Central City 
Midsize Central City 

Urban Fringe of Large /Midsize Central City 

Largc/Smail Town 

Rural 

Wyoming 

Midsize Central City 

Urban Fringe of Midsize Central City 

Urban Fringe of Midsize Central City 

Urban Fringe of Midsize Central City 

Small Town 

Rural 



None 


36 


None 


32 


None 


16 


None 


_22 




106 


Low Percent Minority 


13 


Medium Percent Minority 


12 


High Percent Minority 


12 


Low Percent Minority 


10 


Medium Percent Minority 


9 


High Percent Minority 


10 


Low Percent Minority 


5 


Medium Percent Minority 


6 


High Percent Minority 


5 


Low Percent Minority 


8 


Medium Percent Minority 


8 


High Percent Minority 






106 


None 


35 


None 


31 


Low Percent Minority 


7 


Medium Percent Minority 


8 


High Percent Minority 


7 


None 


JS 




106 


None 


16 


None 


12 


None 


32 


None 


J2 




112 


Low Percent Minority 


5 


Medium Percent Minority 


5 


High Percent Minority 


5 


None 


23 


None 


14 


None 


29 


None 






108 


None 


13 


Low Percent Minority 


6 


Medium Percent Minority 


5 


High Percent Minority 


5 


None 


61 


None 


Ji 




121 



52 



7U 




3.4.4 Metro Status 



All schools in the sampling frame were assigned metro status based on their FIPS county 
code and Census Bureau Metropolitan Statistical Area Definitions as of December 31, 1992. 

The field indicated if the school was located within a metropolitan area or not. This field was 
used as the first stage stratification variable for nonpublic schools. Table 3-4 provides 
information on metro status stratification for the participating jurisdictions. 



3.4.5 School Type 

All nonpublic schools in the sampling frame were assigned a school type (Catholic or 
other nonpublic) based on their QED school type variable. This field was used as the second 
stage stratification variable for nonpublic schools. Table 3-4 provides information on school type 
stratification for the participating jurisdictions. 



3.4.6 Median Household Income 

Prior to the selection of the school samples, the schools were sorted by their primary and 
secondary stratification variables in a serpentine order. Within this sorted list, the schools were 
sorted, in serpentine order, by the median household income. This final stage of sorting 
resulted in implicit stratification of median income. The data on median household income 
were related to the ZIP code area in which the school is located. These data, derived from the 
1990 Census, were obtained from Donnelly Marketing Information Services. 



3 J SCHOOL SAMPLE SELECTION FOR THE 1994 TRIAL STATE ASSESSMENT 

3.5.1 Control of Overlap of School Samples for National Educational Studies 

The issue of school sample overlap has been relevant in all rounds of NAEP in recent 
years. To avoid undue burden on individual schools, NAEP developed a policy for 1994 of 
avoiding overlap between national and state samples. This was to be achieved without unduly 
distorting the resulting samples by introducing bias or substantial variance. The procedure used 
was an extension of the method proposed by Keyfitz (1951). The general approach is given in 
the remainder of this section. Three fourth-grade schools, two in Delaware and one in the 
District of Columbia, were selected for both the national and state assessments. 

To control overlap between NAEP state and national samples, a procedure was used that 
conditions on the nation^ NAEP PSU sample. This simply means that national school selection 
probabilities that were conditional on the selection of national sample PSUs (i.e., within PSU 
school selection probabilities) were used in determining state NAEP school selection 
probabilities. No adjustments were made to state NAEP school selection probabilities in 
jurisdictions where there were no national NAEP PSUs selected. This procedure reduces the 
variance of the state samples, although it leads to a greater degree of sample overlap than if 
unconditional national selection probabilities had been used in the procedure for controlling 



53 



7 i 



o 



Table 3-4 

Distribution of the Selected Nonpublic Schools by Sampling Strata 



Metro Status 


School Tvue 


Originally 
Selected Schools 


Alabama 


In metro area 


Catholic 


2 


In metro area 


Other nonpublic 


7 


Not in metro area 


Other nonpublic 


2 


Arizona 


In metro area 


Catholic 


2 


In metro area 


Other nonpublic 


5 


Not in metro area 


Catholic 


1 


Not in metro area 


Other nonpublic 


3 


Arkansas 


In metro area 


Catholic 


2 


In metro area 


Other nonpublic 


4 


Not in metro area 


Catholic 


1 


Not in metro area 


Other nonpublic 


2 


California 


In metro area 


Catholic 


5 


In metro area 


Other nonpublic 


10 


Colorado 


In metro area 


Catholic 


4 


In metro area 


Other nonpublic 


5 


Not in metro area 


Other nonpublic 


2 


Connecticut 


In metro area 


Catholic 


11 


In metro area 


Other nonpublic 


5 


Not in metro area 


Other nonpublic 


1 


Delaware 


In metro area 


Catholic 


16 


)n metro area 


Other nonpublic 


15 


Not in metro area 


Other nonpublic 


3 


District of Columbia 


In metro area 


Catholic 


14 


In metro area 


Other nonpublic 


12 


Florida 


In metro area 


Catholic 


4 


In metro area 


Other nonpublic 


11 


Not in metro area 


Other nonpublic 


1 


Georg;ia 


In metro area 


Catholic 


1 


In metro area 


Other nonpublic 


8 


Not in metro area 


Other nonpublic 


3 



54 




12 




Table 3*4 (continued) 

Distribution of the Selected Nonpublic Schools by Sampling Strata 



Metro Status 


School Tvdc 


Guam 




Catholic 


« 


Other nonpublic 


Hawaii 


In metro area 


Catholic 


In metro area 


Other nonpublic 


Not in metro area 


Catholic 


Not in metro area 


Other nonpublic 


Idaho 


In metro area 


Catholic 


In metro area 


Other nonpublic 


Not in metro area 


Catholic 


Not in metro area 


Other nonpublic 


Indiana 


In metro area 


Catholic 


In metro area 


Other nonpublic 


Not in metro area 


Catholic 


Not in metro area 


Other nonpublic 


Iowa 


In metro area 


Catholic 


In metro area 


Other nonpublic 


Not in metro area 


Catholic 


Not in metro area 


Other nonpublic 


Kentucky 


In metro area 


Catholic 


In metro area 


Other nonpublic 


Not in metro area 


Catholic 


Not in metro area 


Other nonpublic 


Louisiana 


In metro area 


Catholic 


In metro area 


Other nonpublic 


Not in metro area 


Catholic 


Not in metro area 


Other nonpublic 


Maine 


In metro area 


Catholic 


In metro area 


Other nonpublic 


Not in metro area 


Catholic 


Not in metro area 


Other nonpublic 


* Metro status did not apply to Guam. 


55 



OriginaHy 
Selected Schools 



6 

5 



7 

12 

2 

3 



1 

1 

1 

5 



8 

7 

1 

2 



6 

1 

6 

4 



8 

4 

1 

1 



11 



2 

3 

2 

5 





N> N> Cn 



Table 3-4 (continued) 

Distribution of the Selected Nonpublic Schools by Sampling Strata 



Metro_Status 



School Type 



Originally 
Selected Schools 



Maryland 


In metro area 


Catholic 


9 


In metro area 


Other nonpublic 


10 


Massachusetts 


In metro area 


Catholic 


n 


In metro area 


Other nonpublic 


5 


Not in metro area 


Catholic 


1 


Michigan 


In metro area 


Catholic 


8 


In metro area 


Other nonpublic 


8 


Not in metro area 


Catholic 


2 


Not in metro area 


Other nonpublic 


2 


Minnesota 


In metro area 


Catholic 


9 


In metro area 


Other nonpublic 


5 


Not in metro area 


Catholic 


4 


Not in metro area 


Other nonpubhe 


3 


Mississippi 


In metro area 


Catholic 


1 


In metro area 


Other nonpublic 


3 


Not in metro area 


Catholic 


1 


Not in metro area 


Other nonpublic 


7 


Missouri 






In metro area 


Catholic 


12 


In metro area 


Other nonpublic 


5 


Not in metro area 


Catholic 


2 


Not in metro area 


Other nonpublic 


2 


Montana 


In metro area 


Catholic 


1 


In metro area 


Other nonpublic 


2 


Not in metro area 


Catholic 


4 


Not in metro area 


Other nonpublic 


7 


Nebraska 


In metro area 


Catholic 


9 


In metro area 


Other nonpublic 


3 


Not in metro area 


Catholic 


6 


Not in metro area 


Other nonpublic 


6 


New Hampshire 


In metro area 


Catholic 


6 


In metro area 


Other nonpublic 


3 


Not in metro area 


Catholic 


1 


Not in metro area 


Other nonpublic 


3 



56 







74 



Table 3-4 (continued) 

Distribution of the Selected Nonpublic Schools by Sampling Strata 



ifletro Stilus 


Originally 

School TvDe Selected Schools 


New Jersey 
In metro area 
In metro area 


Catholic 

Other nonpublic ^ 


Mexico 
In metro area 
In metro area 
Not in metro area 
Not in metro area 


Catholic ^ 

Other nonpublic 5 

Catholic 1 

Other nonpublic 5 


New York 
In metro area 
In metro area 
Not in metro area 


Catholic 

Other nonpublic 9 

Catholic 1 


North Carolina 
In metro area 
In metro area 
Not in metro area 


Catholic 1 

Other nonpublic ^ 

Other nonpublic 2 


North Dakota 
In metro area 
In metro area 
Not in metro area 
Not in metro area 


Catholic ^ 

Other nonpublic 1 

Catholic 6 

Other nonpublic ^ 


Pennsylvania 
In metro area 
In metro area 
Not in metro area 
Not in metro area 


CathoUc 19 

Other nonpublic 9 

Catholic 2 

Other nonpublic 1 


Rhode Island 
In metro area 
In metro area 
Not in metro area 
Not in metro area 


Catholic 1^ 

Other nonpublic ^ 

Catholic 1 

Other nonpublic 1 


South Carolina 
In metro area 
In metro area 
Not in metro area 


Catholic 1 

Other nonpublic ^ 

Other nonpublic ^ 


Tennessee 
In metro area 
In metro area 
Not in metro area 


Catholic 2 

Other nonpublic 2 

Other nonpublic 2 



Table 3-4 (continued) 

Distribution of the Selected Nonpublic Schools by Sampling Strata 



Metro Status 



School Type 



Originally 
Selected Schools 



Texas 


In metro area 


Catholic 


3 


In metro area 


Other nonpublic 


4 


Not in metro area 


Other nonpublic 


1 


Utah 


In metro area 


Catholic 


2 


In metro area 


Other nonpublic 


4 


Not in metro area 


Other nonpublic 


1 


Virginia 


In metro area 


Catholic 


3 


In metro area 


Other nonpublic 


6 


Not in metro area 


Other nonpublic 


2 


Washington 


In metro area 


Catholic 


3 


In metro area 


Other nonpublic 


8 


Not in metro area 


Catholic 


1 


Not in metro area 


Other nonpublic 


2 


West Virginia 


In metro area 


Catholic 


4 


In metro area 


Other nonpublic 


3 


Not in metro area 


Catholic 


1 


Not in metro area 


Other nonpublic 


3 


Wisconsin 


In metro area 


Catholic 


14 


In metro area 


Other nonpublic 


10 


Not in metro area 


Catholic 


6 


Not in metro area 


Other nonpublic 


6 


Wyoming 


In metro area 


Catholic 


1 


In metro area 


Other nonpublic 


2 


Not in metro area 


Catholic 


1 


Not in metro area 


Other nonpublic 


4 



58 



O 



7 b 




overlap between state and national samples. The procedure also r^gnizes the impact of the 
heavy within-PSU sampling in noncertainty PSUs in some jurisdictions, even though the 
unconditional probabilities of selection for such schools in the national samples were quite low. 
The procedure worked as follows: 

Let N = 1 if the school is selected in the national sample; let iV = 0 otherwise. Let 
Pf, = P{N = 1). Thus, = 0 for schools not located within a selected national sample PSU. 
Let TTs denote the probability that a school is to be selected for the state fourth-grade sample. 
Schools to be included with certainty in the state sample (tj = 1) are not subject to overlap 
control, as such schools are self-representing in the state sample. Excluding such ^hools on a 
random basis would add undue variance to the state estimates. For actually drawing the state 
samples, a conditional probability of selection, ir/ ,was derived as follows for each school in the 
frame having fourth-grade enrollment: 



This procedure in general gave state NAEP conditional selection probabilities that are 
smaller than the unconditional selection probabilities for schools selected for the national 



selecting the school for the state sample if it was in the national sample. The probability that a 
school was selected in the state sample, conditional on the national PSU sample but 
unconditional on the national school sample selection within PSUs is equal to ir^, as desired. 

This follows from the above formulation of and the fact that P(N = 1) = Pff. The quantity 
Tj is used as the basis for weighting the schools, and hence the students, in the state samples. 

To illustrate the implementation of these expressions for drawing the staje sample, 
consider the following example. Suppose that = 0.3 and P^ - 0-25. Then Tj = 0.4 if the 
school is not selected for the national sample. Thus in this case the school is selected with 
probability 0.4. If the school is selected for the national sample, Tj’ = 0. Thus there is no 
chance that this school will be selected for both the nationed and state samples. Integrating over 



if TTc = 1 




if TTj < Pf, > 0, and N = 1 




if TTj < 1, ^ 0- and (;V = 0) 



iTj = min(l,irj) 



if TTj < h = 0 



The values of x-j* are conditional on the selection of PSUs for the national NAEP 



samples. 



sample. If and Tj are relatively smaU, then < 0 , so that there was no chance of 



N 



ERIC 



59 



the national sampling process gives the required unconditional state selection probability of 
0.3 ( = 0.4 • 1(1 - 0.25) + 1(0.25)). 



Selection of Schocls in Small Jurisdictions 

For jurisdictions with small numbers of public schools — specifically, Delaware, the 
District of Columbia, and Guam — all of the eligible fourth-grade public schools were included in 
the sample with certainty. This did not occur in any of the nonpublic school samples. 



New School Selection 

A district-level file was constructed from the fourth-grade school frame. The file was 
divided into a small districts file, consisting of those districts in which there were at most three 
schools on the aggregate frame and no more than one fourth-, one eighth-, and one twelfth- 
grade school. The remainder of districts were denoted as "large" districts. 

A sample of large districts was drawn in each jurisdiction. All districts were selected in 
Delaware, the District of Columbia, Hawaii, and Rhode Island. The remaining jurisdictions in 
the file of large districts (eligible for sampling) were divided into two files within each 
jurisdiction. Two districts were selected per jurisdiction with equal probability among the 
smaller districts with combined enrollment of less than or equal to 20 percent of the 
jurisdiction’s enrollment. From the rest of the file, eight districts were selected per jurisdiction 
with probability proportional to enrollment. The breakdown given above applied to all 
jurisdictions except Alaska and Nevada, where four and seven districts were selected with equal 
probability and six and three districts were selected with probability proportional to enrollment, 
respectively. The 10 selected districts in each jurisdiction were then sent a listing of all their 
schools that appeared on the QED sampling frame, and were asked to provide information 
about new schools not included in the QED frame. These listings, provided by selected districts, 
were used as sampling frames for selection of new schools. 

The eligibility of a school was determined based on its grade span. A school was 
classified as "new " if the changes of grade span were such that the school status changed from 
ineligible to eligible. The average grade enrollment for these schools was set to the average 
grade enrollment before the grade span change. The schools found eligible for sampling due to 
the grade span change were added to the frame. 

Each fourth-grade school was assigned the measure of size: 

. 60 if enrollment S 70 

enrollment if enrollment > 70 



7b 

ERIC 



60 



The probability of selecting a school was min 



[ 



sampling rat e * measure of size 
P(district) ” 



1 



where P(district) was the probability of selection of a district and the sampling rate was the rate 
used for the particular jurisdiction in the selection of the original sample of schools. 



In each jurisdiction, the sampling rate used for the main sample of fourth-grade schools 
was used to select the new schools for "large" districts. Additionally, all new eligible schools 
coming from "small" districts (those with at most one fourth-grade and one eighth-grade school) 
that had a school selected in the regular sample for the fourth grade were included in the 
sample with certainty. 

Table 3-5 shows the number of new schools coming from the large and small districts for 
the fourth-grade samples. 



3 J.4 Designating Monitor Status 

One-fourth of the selected public schools were designated at random to be monitored 
during the assessment field period in aU jurisdictions that had also participated in the 1992 Trial 
State Assessment. One-half of the selected public schools were designated to be monitored in 
jurisdictions that had not participated in the 1992 assessment — specifically Montana, 

Washington, and Department of Defense Education Activity Overseas. One-half of all 
nonpublic schools in every jurisdiction (regardless of 1992 participation) were designated to be 
monitored. The detaUs of the implementation of the monitoring process in the field are given in 
Chapter 4. The purpose of monitoring a random quarter or half of the schools was to ensure 
that the procedures were being followed throughout each jurisdiction by the school and district 
personnel administering the assessments, and to provide data adequate for assessing whether 
there was a significant difference in assessment results between monitored and unmonitored 
schools within each jurisdiction. 

The following procedure was used to determine the sample of schools to be monitored. 
The initially selected schools were sorted in the order in which they were systematically selected. 
New schools from "large ' districts were added to the sample at the end of the list in random 
order. The sorted schools were then paired, and one member of every other pair was assigned 
at random, with probabUity 0.5, to be monitored. One member of each pair was assigned to be 
monitored’ in jurisdictions requiring 50 percent monitoring of public schools as weU as for aU 
nonpublic school samples. If there was an odd number of schools, the last school was assigned 
monitor status as if it were part of a pair. 



3.5.5 School Substitution and Participation 

A substitute school was assigned to each sampled school (to the extent possible) prior to 
the field period through an automated substitute selection mechanism that used distance 
measures as the matching criterion. Two passes were made at the substitution, one 





Table 3-5 

Distribution of New Schools Coming From Large and Small Districts in the Fourth-grade Sample 



Jurisdiction 


Number of New Schools 


Larg^ Districts 


Small Districts 


Alabama 


0 


0 


Arizona 


0 


0 


Arkansas 


1 


1 


California 


0 


0 


Colorado 


0 


1 


Connecticut 


0 


0 


Delaware 


1 


0 


DoDEA Overseas 


0 


0 


District of Columbia 


1 


0 


Florida 


1 


0 


Georgia 


1 


1 


Guam 


0 


0 


Hawaii 


1 


0 


Idaho 


0 


0 


Indiana 


1 


0 


Iowa 


1 


0 


Kentucky 


0 


0 


Louisiana 


0 


0 


Maine 


0 


2 


Maryland 


1 


0 


Massachusetts 


0 


0 


Michigan 


0 


0 


Minnesota 


0 


0 


Mississippi 


0 


0 


Missouri 


0 


0 


Montana 


0 


0 


Nebraska 


0 


0 


New Hampshire 


0 


0 


New Jersey 


1 


0 


New Mexico 


0 


0 


New York 


1 


0 


North Carolina 


0 


2 


North Dakota 


0 


0 


Pennsylvania 


0 


0 


Rhode Island 


3 


0 


South Carolina 


1 


0 


Tennessee 


0 


0 


Texas 


1 


0 


Utah 


0 


0 


Virginia 


1 


0 


Washington 


0 


0 


West Virginia 


0 


0 


Wisconsin 


0 


0 


Wyoming 


0 


0 


Total 


17 


7 



62 



bU 



O 



assigning substitutes from outside the sampled school's district and a second pass lifting this 
constraint. This strategy was instigated by the fact that most school nonresponse is really at the 
district level. 

A distance measure was used in each pass and was calculated between each sampled 
school and each potential substitute. The distance measure was equal to the sum of four 
squared, standardized differences. The differences were calculated between the sampled and 
potential substitute school's estimated grade enrollment, median household income, percent 
Black enrollment and percent Hispanic enrollment. Each difference was squared and 
standardized to the population standard deviation of the component variable (e.g., estimated 
grade enrollment) across all fourth-grade schools and all jurisdictions. The potential substitutes 
were then assigned to sampled schools by order of increasing distance measure. An acceptance 
limit was put on the distance measure of 0.60. A given potential substitute was assigned to one 
and only one sampled school. Some sampled schools did not receive assigned substitutes (at 
least in the first pass) because the number of potential substitutes was less than the number of 
sampled schools or the distance measure for all remaining potential substitutes outside of the 
same district was greater than 0.60. 

In the second pass, the different district constraint was lifted and the maximum distance 
allowed was raised to 0.75. This generally brought in a small number of additional assigned 
substitutes. Although the selected cut-off points of 0.60 and 0.75 on the distance measure were 
somewhat arbitrary, they were decided upon by reviewing a large number of listings beforehand 
and finding a consensus on the distance measures at which substitutes began to appear 
unacceptable. 

Table 3-6 includes information about the number of substitutes provided in each 
jurisdiction. Of the 44 jurisdictions participating, 34 were provided with at least one substitute. 
Among jurisdictions receiving no substitutes, the majority had 100 percent participation from the 
origind sample. In a few cases, however, refusals did occur after the November 1 deadline. 

The number of substitutes provided to a jurisdiction ranged from zero to 24 in the fourth-grade 
sample. A total of 243 substitutes were selected. Some jurisdictions did not attempt to solicit 
participation from the substitute schoo. provided, as they considered the timing too late to seek 
cooperation from schools not previously notified about the assessment. 

Tables 3-7 and 3-8 shows the number of schools in the fourth grade reading samples, 
together with school response rates observed within participating jurisdictions. The table also 
shows the number of substitutes in each jurisdiction that were associated with a nonparticipating 
original school selection, and the number of those that participated. The numbers of 
participating schools differ slightly from those given in Chapter 4. The numbers of participating 
schools in Tables 3-7 and 3-8 indicate the numbers of schools from which useable assessment 
data were received. In a few instances, assessments were conducted but the data were never 
received. 



3.6 STUDENT SAMPLE SELECTION 

Schools initially sent a complete list of students to a central location in November 1993. 
Schools were not asked to list students in any particular order, but were asked to implement 



Table 3-6 

Substitute School Counts for Fourth-grade Schools 



Jurisdiction 


Substitutes 


Alabama 


8 


Arizona 


0 


Arkansas 


9 


California 


12 


Colorado 


1 


Connecticut 


2 


Delaware 


0 


DoDEA Overseas 


0 


District of Columbia 


0 


Florida 


3 


Georgia 


1 


Guam 


0 


Hawaii 


2 


Idaho 


24 


Indisuia 


10 


Iowa 


15 


Kentucky 


10 


Louisiana 


2 


Maine 


4 


Maryland 


3 


Massachusetts 


1 


Michigan 


17 


Minnesota 


12 


Mississippi 


4 


Missouri 


2 


Montana 


6 


Nebraska 


8 


New Hampshire 


9 


New Jersey 


7 


New Mexico 


0 


New York 


22 


North Carolina 


0 


North Dakota 


18 


Pennsylvania 


4 


Rhode Island 


6 


South Carolina 


4 


Tennessee 




Texas 


3 


Utah 


0 


Virginia 


1 


Washington 


0 


West Virginia 


1 


Wisconsin 


7 


Wyoming 


0 


Total 


243 



o 



64 




Table 3-7 

Distributioa of the Fourth-grade Public-school Sample by Jurisdiction 





Weigiiud Perceat School 
Participation 


Number of SebooU in the Orictaal 
Sample 


Numtv^ of Subsiluiie 
SebooU for 

Nonpart Icipat Lag Originali 


Number of 
Partici- 
pating 
SebooU 


JniiMlktloa 


Before 

SobfdtvtkMi 


After 

SubfiltutlOB 


Tout 


Not 

EUfible 


Participated 


Provided 


Participated 


Alabama 


86.78 


93 J9 


107 


1 


92 


14 


7 


99 


Arizona 


99.04 


99.04 


107 


2 


104 


1 


0 


104 


Arkansas 


86.20 


94.09 


109 


6 


89 


9 


8 


97 


California 


80.09 


90.52 


106 


0 


85 


21 


11 


% 


Colorado 


100.00 


100.00 


108 


0 


108 


0 


0 


108 


Connecticut 


96.47 


96.47 


105 


1 


100 


3 


0 


100 


Delaware 


100.00 


100.00 


54 


3 


51 


0 


0 


51 


DoDEA Overseas 


99.25 


99.25 


83 


1 


81 


1 


0 


81 


Dist. of Columbia 


100.00 


100.00 


117 


10 


107 


0 


0 


107 


Florida 


100.00 


100.00 


107 


0 


107 


0 


0 


107 


Georgia 


99.05 


99.05 


107 


2 


104 


1 


0 


104 


Guam 


100.00 


100.00 


21 


0 


21 


0 


0 


21 


Hawaii 


99.07 


99.07 


106 


1 


104 


0 


0 


104 


Idaho 


69.23 


91.45 


109 


1 


74 


2: 


24 


98 


Indiana 


83.07 


92.48 


107 


0 


89 


18 


10 


99 


Iowa 


85.29 


99.05 


no 


2 


92 


16 


15 


107 


Kentucky 


88.48 


96.16 


107 


2 


93 


11 


8 


101 


Louisiana 


100.00 


100.00 


105 


2 


103 


0 


0 


103 


Maine 


94.41 


96.99 


116 


9 


101 


4 


3 


104 


Maryland 


94.23 


96.15 


106 


2 


98 


6 


2 


100 


Massachusetts 


97.02 


97.02 


105 


3 


99 


3 


0 


99 


Michigan 


63.19 


79.77 


106 


3 


65 


38 


17 


82 


Minnesota 


85.67 


95.22 


107 


2 


90 


14 


10 


100 


Mississippi 


95.20 


99.04 


105 


1 


99 


4 


4 


103 


Missouri 


96.48 


98.40 


109 


2 


103 


4 


2 


105 


Montana 


85.10 


88.58 


135 


6 


105 


23 


6 


111 


Nebraska 


7059 


77 J5 


144 


2 


101 


38 


8 


109 


New Hampshire 


71.17 


7. J 


109 


0 


77 


25 


9 


86 


New Jersey 


85.15 


91 


107 


2 


89 


15 


7 


96 


New Mexico 


100.00 


100.00 


108 


3 


105 


0 


1) 


105 


New York 


74.54 


90.57 


106 


0 


79 


26 


17 


96 


North Carolina 


99.05 


99.05 


108 


2 


105 


1 


0 


105 


North Dakota 


79 62 


91 19 


129 


1 


101 




16 


117 


Pennsylvania 


79.85 


83.69 


107 


2 


84 


20 


4 


88 


Rhode Island 


80.22 


85.54 


109 


2 


86 


14 


6 


92 


South Carolina 


95.26 


97.15 


106 


1 


100 


4 


2 


102 


Tennessee 


71.85 


73.79 


106 


3 


74 


27 


2 


76 


Texas 


91.26 


93.20 


108 


4 


95 


9 


2 


97 


Utah 


100.00 


100.00 


106 


1 


105 


0 


0 


105 


Virginia 


98.10 


99 05 


107 


2 


103 


2 


1 


104 


Wuhington 


100.00 


100.00 


106 


2 


i04 


0 


0 


104 


West Virginia 


99 07 


100.00 


112 


1 


no 


1 


1 


111 


Wisconsin 


79.15 


85.56 


108 


3 


83 


21 


7 


90 


Wyoming 


98 JS 


9838 


121 


5 


112 


4 


0 


112 


Total 






4673 


98 


4077 


447 


209 


4286 







65 



S3 









Table 3-8 

Distribution of the Fourth-grade Nonpublic-school Sample by Jurisdiction 



Jurisdiction 


Weighted Percent School 
Participation 


Number of Schools in the Original 
Sample 


Number of Substitute 
Schools for 

Nonpaiiicipatlng Originals 


Number of 
Partici. 
paling 
Schools 


Before 

Sobititution 


Alter 

Substitution 


Toul 


Not 

Eligible 


Participated 


Prorided 


Participated 


Alabama 


91.67 


95.83 


11 


0 


8 


3 


1 


9 


Arizona 


34.93 


34.93 


11 


0 


3 


8 


0 


3 


Arkansas 


80.83 


93.78 


9 


0 


6 


3 


1 


7 


California 


41.96 


51.42 


15 


4 


5 


6 


1 


6 


Colorado 


71J6 


85.00 


11 


1 


7 


3 


1 


8 


Conneaicu* 


72.70 


81.89 


17 


1 


11 


4 


2 


13 


Delaware 


72.66 


72.66 


34 


3 


22 


5 


0 


22 


Dist. of Columbia 


41,74 


41.74 


26 


0 


12 


3 


0 


12 


Florida 


52.21 


73.28 


16 


X 


8 


7 


3 


11 


Georgia 


74.03 


83,77 


12 


1 


8 


3 


1 


9 


Guam 


96.02 


96.02 


11 


0 


9 


0 


0 


9 


Hawaii 


80 40 


87.54 


24 


2 


17 


5 


2 


19 


Idaho 


89.29 


89.29 


8 


0 


7 


1 


0 


7 


Indiana 


85.09 


85.09 


18 


4 


10 


4 


0 


10 


Iowa 


100 00 


100.00 


17 


1 


16 


0 


0 


16 


Kentucky 


69.71 


85.12 


14 


0 


10 


4 


2 


12 


Louisiana 


81.91 


9130 


21 


0 


17 


4 


2 


19 


Maine 


79.45 


10000 


12 


4 


7 


1 


1 


8 


Maryland 


62.74 


69.81 


19 


2 


10 


7 


1 


11 


Massachusetts 


94.52 


100.00 


17 


2 


14 


1 


1 


15 


Michigan 


00.00 


00.00 


20 


3 


0 


17 


0 


0 


Minnesota 


90.63 


98.78 


21 


0 


18 


3 


2 


20 


Mississippi 


64.40 


64.40 


12 


1 


7 


4 


0 


7 


Missouri 


90.35 


90.35 


21 


0 


19 


2 


0 


19 


Montana 


64.78 


64.78 


14 


2 


7 


5 


0 


7 


Nebraska 


48.05 


48.05 


24 


0 


11 


11 


0 


11 


New Hampshire 


53.85 


53.85 


13 


2 


5 


6 


0 


5 


New Jersey 


76.19 


76.19 


23 


1 


17 


5 


0 


17 


New Mexico 


100.00 


10000 


14 


5 


9 


0 


0 


9 


New York 


40 34 


61.52 


25 


0 


10 


15 


5 


15 


North Carolina 


32.26 


32.26 


9 


2 


2 


5 


0 


2 


North Dakota 


76.60 


90.88 


17 


2 


12 


2 


2 


14 


Pennsylvania 


72.42 


72.42 


31 


5 


17 


9 


0 


17 


Rhode Island 


92 98 


92 98 


20 


1 


17 


2 


0 


17 


South Carolina 


69.12 


85.71 


12 


3 


5 


4 


2 


7 


Tennessee 


41.06 


41.06 


11 


1 


4 


6 


0 


4 


Texas 


24.24 


39.39 


8 


1 


2 


5 


1 


3 


Utah 


22.88 


22.88 


7 


1 


1 


5 


0 


1 


Virginia 


80.75 


80.75 


11 


1 


8 


1 


0 


8 


Washington 


00.00 


00.00 


14 


0 


0 


14 


0 


0 


Wwt Virginia 


86.13 


86.13 


11 


2 


7 


2 


0 


7 


Wisconsin 


65.71 


65.71 


36 


4 


20 


12 


0 


20 


Wyoming 


00 00 


00 00 


8 


0 


0 


8 


0 


0 


Total 






705 


63 


405 


215 


31 


436 



66 



84 



checks to ensure that all fourth-grade students were listed. Based on the total number of 
students on this list, called the Student Listing Form, sample line numbers were generated for 
student sample selection. To generate these line numbers, the sampler entered the number of 
students on the form and the number of sessions into a calculator that had been programmed 
with the sampling algorithm. The calculator generated a random start that was used to 
systematically select the student line numbers (30 per session). Delaware was the only 
jurisdiction for which more than one session was conducted in a school. Up to three sessions 
were conducted in Delaware public schools, with the exact number of sessions being determined 
by the fourth-grade enrollment of each school. To compensate for new enroUees not on the 
Student Listing Form, extra line numbers were generated for a supplemental sample of new 
students. 

After the student sample was selected, the administrator at each school identified 
students who were incapable of taking the assessment either because they had an Individualized 
Education Plan or because they were Limited English Proficient. More details on the 
procedures for student exclusion are presented in the report on field procedures for the Trial 
State Assessment Program. 

When the assessment was conducted in a school, a count was made of the number of 
nonexcluded students who did not attend the session. If this number exceeded three students, 
the school was instructed to conduct a make-up session to which were invited all students who 
had been absent from the initial session. 

Tables 3-9 and 3-10 provide the distribution of the student samples and response rates by 
jurisdiction. 



Table 3-9 

Distribution of the Fourth-grade Public-school Student Sample and Response Rates by Jurisdiction 



JurisdktioB 


Weighted Student 


Number of Students 


Response Rate 
(Percent) 


In Original 
Sample 


Excluded 
front Sample 


To Be 
Assessed 


Actually 

Assessed 


Alabama 


96.07 


2911 


162 


2749 


2646 


Arizona 


94.27 


3010 


204 


2806 


2651 


Arkansas 


95.96 


2808 


169 


2639 


2535 


California 


93.86 


2801 


404 


2397 


2252 


Colorado 


94.25 


3120 


221 


2899 


2730 


Connecticut 


95.6^ 


2944 


248 


2696 


2578 


Delaware 


95.58 


2496 


153 


2343 


2239 


DoDEA Overseas 


94.58 


2666 


117 


2549 


2413 


Distria of Columbia 


94.52 


3072 


271 


2801 


2646 


Florida 


93.96 


3167 


326 


2841 


2666 


Georgia 


95.45 


3072 


171 


2901 


2766 


Guam 


95.91 


2517 


220 


2297 


2203 


Hawaii 


95.45 


3020 


154 


2866 


2732 


Idaho 


96.12 


2847 


145 


2702 


2598 


Indiana 


95.86 


2919 


153 


2766 


2655 


Iowa 


95.54 


3024 


140 


2884 


2759 


Kentucky 


96.68 


2963 


114 


2849 


2758 


Louisiana 


%08 


3007 


181 


2826 


2713 


Maine 


94.32 


2854 


275 


2579 


2436 


Maryland 


95.20 


2897 


216 


2681 


2555 


Massachusetts 


95.43 


2874 


236 


2638 


2517 


Michigan 


94.87 


2397 


135 


2262 


2142 


Minnesota 


95.49 


2915 


133 


2782 


2655 


Mississippi 


96.68 


3022 


169 


2853 


2762 


Missouri 


95.04 


2972 


156 


2816 


2670 


Montana 


95.72 


2711 


93 


2618 


2501 


Nebraska 


94.85 


2634 


114 


2520 


2395 


New Hampshire 


95.58 


2441 


145 


2296 


2197 


New Jersey 


9530 


2799 


162 


2637 


2509 


New Mexico 


94.68 


3022 


241 


2781 


2635 


New York 


9534 


2847 


227 


2620 


2495 


Nonn Carolina 


95.83 


3127 


173 


2954 


2833 


North Dakota 


%.63 


2690 


59 


2631 


2544 


Pennsylvania 


94.13 


2569 


143 


2426 


2290 


Rhode Island 


94.70 


2614 


140 


2474 


1342 


South Carolina 


96.39 


2999 


190 


2809 


2707 


Tennessee 


95 63 


2217 


127 


2090 


1998 


Texas 


96.45 


2862 


317 


2545 


2454 


Utah 


94.82 


3034 


153 


2881 


2733 


Virginia 


94.65 


3089 


216 


2873 


2719 


Washington 


94 45 


3054 


158 


2896 


2737 


West Virginia 


95 88 


3087 


213 


2874 


2757 


Wisconsin 


9634 


2609 


189 


2420 


2331 


Wyoming 


95 92 


2943 


130 


2813 


2699 


Total 




125643 


8063 


117580 


112153 





tf r+ ;1£“ ■ „ iJfcfeS- 4_m 'p pv 






!j_JT * **7,«a'< 






Table 3-10 

Distribution of the Fourth-grade Nonpublic-school Student Sample and Response Rates by Jurisdiction 



Jvrisdictioa 


Weighted Student 
Response Rate 
(Percent) 


Number of Students 


In Original 
Sample 


Excluded 
from Sample 


To Be 
Assessed 


Actually 

Assessed 


Alabama 


95.00 


212 


4 


208 


199 


Arkansas 


94.67 


164 


1 


163 


154 


California 


97.11 


153 


0 


153 


149 


Colorado 


93.93 


139 


0 


139 


130 


Connecticut 


95.11 


310 


5 


305 


290 


Delaware 


97.53 


558 


0 


558 


544 


District of Columbia 


96.67 


281 


4 


277 


267 


Florida 


98.12 


273 


1 


272 


267 


Georgia 


96.59 


225 


0 


225 


217 


Guam 


97.90 


380 


0 


380 


372 


Hawaii 


96.23 


430 


2 


428 


415 


Idaho 


96.02 


98 


0 


98 


94 


Indiana 


95.03 


231 


2 


229 


219 


lo^/a 


98.83 


334 


3 


331 


327 


Kentucky 


96.61 


287 


0 


287 


278 


Louisiana 


96.81 


474 


2 


472 


457 


Maine 


94.63 


90 


0 


90 


85 


Maryland 


96 96 


286 


3 


283 


275 


Massachusetts 


96.15 


321 


7 


314 


302 


Minnesota 


96.06 


415 


8 


407 


390 


Mississippi 


95.70 


169 


6 


163 


156 


Missouri 


95 J8 


392 


1 


391 


372 


Montana 


94.49 


157 


0 


157 


148 


Nebraska 


97.17 


218 


0 


218 


211 


New Jersey 


05.86 


399 


3 


396 


370 


New Mexico 


9230 


229 


22 


207 


191 


New York 


9635 


389 


7 


382 


369 


North Dakota 


93.22 


277 


7 


270 


253 


Pennsylvania 


94.43 


456 


2 


454 


427 


Rhode Island 


96.17 


369 


1 


368 


354 


South Carolina 


98.07 


160 


0 


160 


156 


Virginia 


95.95 


159 


1 


158 


151 


West Virginia 


97.17 


135 


1 


134 


130 


Wisconsin 


9501 


407 


1 


406 


388 


Total 




9577 


94 


9483 


9116 



69 

bV 



O 




Chapter 4 

STATE AND SCHOOL COOPERATION AND FIELD ADMINISTRATION 



Nancy Caldwell and Mark M. Waksberg 
Westat, Inc. 



4.1 OVERVIEW 

By volunteering to participate in the Trial State Assessment and in the field test that 
preceded it, each jurisdiction assumed responsibility for securing the cooperation of the schools 
sampled by NAEP. The participating jurisdictions were responsible for the actual administration 
of the 1994 Trial State Assessment at the school level. The 1993 field test, however, operated 
within the framework of the national (rather than Trial State) model. Therefore, for the field 
test, NAEP field staff were responsible for securing cooperation, scheduling, and conducting the 
assessments. This chapter describes state and school cooperation and field administration 
procedures for both the 1993 field test and the 1994 program. Section 4.2 presents information 
on the field test, while section 4.3 focuses on the 1994 Trial State Assessment. 



4.2 THE FIELD TEST 
42A Conduct of the Field Test 

In preparation for the 1994 state and national assessment programs, a field test of the 
forms, procedures, and booklet items was helo in 1993. In the 1993 field test, assessments were 
piloted in five subject areas: reading, mathematics, science, U.S. history, and world geography. 
The field test design focused on instructionally relevant approaches to assessment such as 
performance-based science tasks, the use of calculators, protractors, rulers and other 
manipulatives in mathematics, and the introduction of a world atlas and a retail catalog as 
resource tools in the geography and reading assessments. 

In August 1993, letters were sent from the U.S. Department of Education to all Chief 
State School Officers inviting them to participate in the field test of materials and procedures 
for 1994. In an effort to secure the participation of more schools and to lessen the burden of 
participation on jurisdictions, ETS and Westat offered to perform all of the work involved, 
including sampling, communicating with school staff, and administering the assessment. 

The school sample for the field test included both public and nonpublic schools and was 
designed to involve as many jurisdictions as possible, thus limiting the burden on each 
jurisdiction. However, small jurisdictions in which all of the schools had been involved in the 

71 



S8 



O 



1992 NAEP were excluded from the 1993 field test sample, which had the effect of eliminating 
the small jurisdictions entirely. As a result, the field test sample was spread very roughly in 
proportion to the population across 38 jurisdictions. Each participating jurisdiction was asked to 
appoint a state coordinator to serve as the liaison between NAEP/Westat staff and the 
participating schools. State coordinators were asked only to notify districts of their inclusion 
and to support the schools’ participation in the field test. 

The original school sample comprised 905 schools. For each originally sampled school, 
up to three substitutes or "alternate" schools were named by Westat. The three levels of 
alternate schools included specified substitutes within the same district that were 
demo^aphically comparable to the originally selected schools, an option that allowed district 
superintendents to choose their own alternate schools. In the event that a district was not able 
to participate, an "out-of-district" alternate school was offered. The type and number of sessions 
scheduled for an originally selected school remained constant across alternates. 

From October to December 1992, all districts and schools in the 1993 field test sample 
were contacted, cooperation obtained, and assessment schedules set. To accomplish this, 11 of 
the most experienced NAEP supervisors were each responsible for gaining cooperation in 
districts and schools in several jurisdictions. In January 1993, the NAEP field staff expanded to 
51 supervisors. Each supervisor, including those in the original group, was responsible for 
sampling and conducting assessments in a single region of approximately 20 schools. 



Results of the Field Test 

A total of 844 originally selected schools and alternates actually participated in the field 
test. The final assessed sample of schools included 300 schools at grade 4, 273 schools at grade 
8, and 272 schools at grade 12. 

A total of 46,849 students participated in the field test; 13,962 students at grade 4, 

17,439 students at grade 8, and 15,448 students at grade 12. The overall student participation 
rate was 86.8 percent: 93 percent at grade 4, 89.4 percent at grade 8, and 79.3 percent at grade 
12. A total of 811 students (1.7%) who were sampled for the assessment were excluded from 
participation by their schools. 

Depending on the size of the school, a school’s sample numbered approximately 30 to 60 
students, who were assessed in either one or two sessions. The desired number of student 
responses to the assessment items being tested was achieved. 



4J THE 1994 TRIAL STATE ASSESSMENT 

Forty-one states, the District of Columbia, and Guam volunteered for the 1994 Trial 
State Assessment, as did the Department of Defense Education Activity (DoDEA) overseas 
schools. Figure 4-1 identifies the jurisdictions participating in the last two assessment 



72 



8'J 



O 




years (similar information is presented in table form in Chapter 1). As was the case for the 
1992 Trial State assessment, each jurisdiction designated its own coordinator to oversee 2 ill 
assessment activities in their jurisdiction. 



4J.1 Overview of Responsibilities 

Data collection for the 1994 Trial State Assessment involved a collaborative effort 
between the participating jurisdictions and the NAEP contractors, especially Westat, the field 
administration contractor. Westat’s responsibilities included: 

• selecting the sample of schools and students for each participating jurisdiction; 

• developing the administration procedures and manuals; 

• training state personnel to conduct the assessments; and 

• conducting an extensive quality assurance program. 

Each jurisdiction volunteering to participate in the 1994 program was asked to appoint a 
state coordinator. In general, the coordinator was the liaison between NAEP/Westat staff and 
the participating schools. In particular, the state coordinator was asked to: 

• gain the cooperation of the selected schools; 

• assist in the development of the assessment schedule; 

• receive the lists of all grade-eligible students from the schools; 

• coordinate the flow of information between the schools and NAEP; 

• provide space for the state supervisor to use when selecting the sample of 
students; 

• notify assessment administrators about training and send them their manuals; and 

• send the lists of sampled students to the schools 

At the school level, an assessment administrator was responsible for preparing for and 
conducting the assessment session(s) in one or more schools. These individuals were usually 
school or district staff and were trained by Westat staff. The assessment administrator’s 
responsibilities included 

• receiving the list of sampled students from the state coordinator; 

• identifying sampled students who should be excluded; 

• distributing assessment questionnaires to appropriate school staff and collecting 
them upon their completion; 

• notifying sampled students and their teachers; 

• administering the assessment sessions(s); 

• completing assessment forms; and 

• preparing the assessment materials for shipment. 

Westat hired and trained six field managers and 46 state supervisors, one for each 
jurisdiction (two supervisors were hired for DoDEA overseas schools, one working in Europe 
and the other in the Far East). Each field manager was responsible for working with the state 



coordinators of seven to eight jurisdictions and for overseeing assessment activities. The 
primary tasks of the field managers were to: 



obtain information about cooperation and scheduling; 

make sure the arrangements for the assessments were set and assessment 

administrators identified; and 

schedule the assessment administrators training sessions. 



State Assessment to the state coordinators. At the time of this mailing, a 
final decision had not been made as to which grades would be include 1 in 
the Trial State Assessment, so the lists included fourth-, eighth-, and 
twelfth-grade schools. Some state coordinators chose to inform all 
selected districts and schools immediately, while othe; ^ waited until the 
final authorization was received from Congress and then informed only 
the schools with fourth grade. 



computerized state coordinator system, which could be used to keep track 
of assessment-related activities. 

Westat distributed Student Listing Forms, Principal Questionnaires, and 
the list of the schools selected for the Trial State Assessment, updated 
with a suggested week of assessment and number of sessions. 



computer files. 

State coordinators sent Student Listing Forms, Supplemental Student 
Listing Forms, and Principal Questionnaires to participating schools. 



The primary tasks of the state supervisors were to 



select the sample of students to be assessed; 

recruit and hire the quality control monitors throughout their jurisdiction; 
conduct in-person assessment administrator training sessions; and 
coordinate the monitoring of the assessment sessions and makeup sessions. 



Westat also hired and trained an average of four quality control monitors in each 
jurisdiction to monitor the assessment sessions. 



432 Schedule of Data Collection Activities 



August 1993 



Westat sent the samples of schools selected for the national and Trial 



October 1993 



Westat field managers visited each jurisdiction to explain the 



September- State coordinators obtained cooperation from districts and schools and 

November 1993 reported participation status to Westat field managers via printed lists or 



November 3-6. State supervisors were trained. 



1993 



75 




O 



November 15, 
1993 

November 29- 
December 10, 
1993 



December 7, 
1993-January 7, 
1994 



Suggested cutoff for decisions on participating schools and submission of 
list of grade-eligible students to state coordinators for sampling purposes. 

NAEP state supervisors visited state coordinators to select student 
samples and prepare Administration Schedules listing the students 
selected for each session. 

Westat provided the schedule of training sessions and copies of the 
Manual for Assessment Administrators to state coordinators for 
distribution. 

State coordinators notified assessment administrators of the date and time 
of training and sent each a copy of the Manual for Assessment 
Administrators. 



January 6-8, 1994 Quality control monitors were trained. 



January 10-28, Assessment administrators were trained. 

1994 



January 31- Assessments were conducted. Unannounced visits were made by quality 

February 25, control monitors to a predetermined subset of the sessions. 

1994 



February 23- Make-up sessions were held as necessary. 

March 4, 1994 



4JJ Preparations for the Trial State Assessment 

The focal point of the schedule for the Trial State Assessment was the period between 
January 31 and February 25, 1994 when the assessments were conducted in the schools. 
However, as with any undertaking of this magnitude, the project required many months of 
planning and preparation. 

Westat selected the samples of schools according to the procedures described in Chapter 
3. On August 18, 1993, lists of the selected schools and other materials describing the Trial 
State Assessment Program were sent to state coordinators. Most state coordinators also 
preferred that NAEP provide a suggested assessment date for each school. School listings were 
updated with this information and were sent to the state coordinators, along with other 
descriptive materials and forms, in October. 

State coordinators were also given the option of receiving the school information in the 
form of a computer database with accompanying management information software. This 
system enabled state coordinators to keep track of the cooperating schools, the assessn ent 
schedule, the training schedule, and the assessment administrators. Coordinators could choose 
to receive a laptop computer and printer or to have the system installed on their own computer, 

76 



93 



Westat field managers traveled to the state offices to explain the computer system to the state 
coordinators and their staff. All but two state coordinators chose to use the computerized 
system. 



Six of the most experienced NAEP supervisors were chosen to be field managers, the 
primary link between NAEP and the state coordinators. In October, the field managers visited 
offices of the state coordinators to explain the computer system to state staff. The field 
managers kept in frequent contact with the state coordinators as the state coordinators secured 
the cooperation of the selected schools and established the assessment schedule. 

The field managers used the same computer system as the state coordinators to keep 
track of the schools and the schedule. The state coordinators sent updates via computer disks, 
telephone, or print to their field manager, who then entered the information into the system. 
Weekly transmissions were made from the field manager to Westat. 

By the first of November, Westat hired one state supervisor for each participating 
jurisdiction. The state supervisors attended a training session held November 3-6, 1993. This 
training session focused on the state supervisors’ immediate tasks — selecting the student 
samples and hiring quality control monitors. Supervisors were given the training script and 
materials for the assessment administrators’ training sessions they would conduct in January so 
they could become familiar with these materials. 

The state supervisors’ first task after training was to complete the selection of the sample 
of students who were to be assessed in each school. All participating schools were asked to send 
a list of their grade-eligible students to the state coordinator by November 15. Sample selection 
activities were conducted in the state coordinator's office unless the state coordinator preferred 
that the lists be taken to another location. 

Using a preprogrammed calculator, the supervisors generally selei led a sample of 30 
students per session type per school. The exceptions to this were in small schools and 
Jurisdictions with fewer than the necessary 100 fourth-grade public schools. In the Jurisdictions 
with fewer schools, larger student samples were required from schools that participated. In the 
1994 Trial State Assessment, this was only necessary in the schools in Delaware and DoDEA 
overseas. 

After the sample was selected, the supervisor completed an Administration Schedule for 
each session, listing the students to be assessed. The Administration Schedules for each school 
were put into an envelope and given to the state coordinator to send to the school two weeks 
before the scheduled assessment date. Included in the envelope were instructions for sampling 
students who had enrolled at the schools since the creation of the original list. 

During the period from mid-November through December, the state supervisors also 
recruited and hired quality control monitors to work in their Jurisdictions. It was the quality 
control monitor’s Job to observe the sessions designated to be monitored, to complete an 
observation form on each session, and to intervene when the correct procedures were not 
followed. Since studies have shown no measurable difference between the performance in 
monitored and unmonitored sessions, the ratio of monitored schools was lowered to reduce the 
costs of the field work. In any Jurisdiction in which the fourth grade had previously participated 

77 




O 



in the Trial State Assessment, the percentage of public schools to be monitored was reduced to 
25 percent. Because nonpublic schools were included for the first time, their monitor rate was 
50 percent. Also, any jurisdiction that had not previously participated at the fourth-grade level 
was monitored at 50 percent as well. The schools to be monitored were known only to 
contractor staff; it was not on any of the listings provided to state staff. 

Approximately 200 quality control monitors were trained in a session held January 6-8, 
1994. The first day of the training session was demoted to a presentation of the assessment 
administrators’ training program by the state supervisors, which not only gave the monitors an 
understanding of what assessment administrators were expected to do, but gave state supervisors 
an opportunity to practice presenting the training program. The remaining days of the training 
session were spent reviewing the quality control monitor observation form and the role and 
responsibilities of the quality control monitors. 

Almost immediately following the quality control monitor training, supervisors began 
conducting training for assessment administrators. Each quality control monitor attended 
several of these training sessions, to assist the state supervisor and to become thoroughly 
familiar with the assessment administrator’s responsibilities. Almost 4,700 assessment 
administrators were trained in about 450 training sessions across the nation. 

To ensure uniformity in the training sessions, Westat developed a highly structured 
program involving a script for trainers, a videotape, and an example to be completed by the 
trainees. The supervisors were instructed to read the script verbatim as they proceeded through 
the training, ensuring that each trainee received the same information. The script was 
supplemented by the use of overhead transparencies, displaying the various forms that were to 
be used and enabUng the trainer to demonstrate how they were to be filled out. 

The videotape, similar to the one used in the 1992 Trial State Assessment, was 
developed by Westat to provide background for the study and to simulate the various steps of 
the assessment that would be repeated by the administrators. The portions of the videotape 
depicting an assessment had been taped in a classroom with students in attendance to closely 
simulate an actual assessment session. The videotape was divided into sections with breaks for 
review by the trainer and practice for the trainees. 

The final component of the presentation was a training example consisting of a set of 
exercises keyed to each part of the training package. A portion of the videotape was shown and 
then reviewed by the trainer. Related exercises were then completed by the trainees before the 
next subject was discussed. 

The entire training session generally ran for about three and one-half hours. Sessions 
usually began in the morning and ended with lunch. This reduction in time (from about five 
hours in 1990) for the training session, initiated in 1992, was appreciated by the trainees. 

All of the information presented in the training session was included in the Manual for 
Assessment Administrators, developed by Westat. Copies of the manual were sent by Westat to 
the state coordinators at the beginning of December so that they could be distributed to the 
assessment administrators before the training sessions. 



78 



O 

ERIC 



95 



43.4 Monitoring of Assessment Activities 



Two weeks prior to the scheduled assessment date, the assessment administrator 
received the Administration Schedule and assessment questionnaires and materials. Five days 
before the assessment, the quality control monitor made a call to the administrator and 
recorded the results of the call on the Observation Form. Most of the questions asked in the 
pre-assessment call were designed to gauge whether the assessment administrator had received 
all materials needed and had completed the preparations for the assessment. The 40-page 
Quality Control Monitor Observation Form is included in the Report on Data Collection 
Activities for the 1994 National Assessment of Educational Progress (Westat, Inc, 1995). 

Pre-assessment calls were made to all schools regardless of whether they were to be 
monitored. If the sessions in a school were not observed, the quality control monitor called the 
assessment administrator three days after the assessment to find out how the session went, to 
obtain the assessment administrator’s impressions of the manual, training, and materials and to 
ensure that all post-assessment activities had been completed. 

If the sessions in a school were to be monitored, the quality control monitor was to 
arrive at the school one hour before the scheduled beginning of the assessment to observe 
preparations for the assessment. To ensure the confidentiality of the assessment items, the 
booklets were packaged in shrink-wrapped bundles and were not to be opened until the quality 
control monitor arrived or 45 minutes before the session began, whichever occurred first. 

In addition to observing the opening of the bundles, the quality control monitor used the 
Observation Form to check that the following had been done correctly: sampling newly enrolled 
students, reading the script, distributing and collecting assessment materials, timing the booklet 
sections, answering questions from students, and preparing assessment materials for shipment. 

After the assessment was over, the quality control monitor obtained the assessment 
administrator’s opinions of how the session went and how well the materials and forms worked. 

If four or more students were absent from the session, a makeup session was to be held. 
If the original session had been monitored, the makeup session was also monitored. This 
required coordination of scheduling between the quality control monitor and assessment 
administrator. 



4J3 School and Student Participation 

Table 4-1 shows the results of the state coordinators' efforts to gain the cooperation of 
the selected schools. Overall, 4,295 public schools and 437 nonpublic schools participated in the 
1994 Trial State Assessment. This is about 86 percent (unweighted) of the eligible schools in 
the original sample at each grade and about 91 percent (unweighted) of the sample after 
substitution. 



Table 4-1 

School Participation, 1994 Trial State Assessment 



Status 


Public 


Nonpublic 


Total 


Schools in original sample 


4671 


705 


5376 


Schools not eli^ble (e.g. closed, no grade 4) 


98 


63 


161 


Eligible schools in original sample 


4573 


642 


5215 


Noncooperating (e.g. school, district, or jurisdiction 
refusal) 


489 


236 


725 


Participating 


4084 


406 


4490 


Substitutes provided for noncooperating schools 


441 


214 


655 


Participating substitutes 


211 


31 


242 


Total schools participating after substitution 


4295 


437 


4732 



Participation results for students in the 1994 Trial State Assessment in reading are given 
in Table 4-2. Approximately 140,000 students were sampled. As can be seen from the table, the 
original sample, which was selected by the NAEP state supervisors, comprised about 136,000 (or 
97%) of this number. The original sample size was increased somewhat after the supplemental 
samples had been drawn (from students newly enrolled since the creation of the original lists). 



Table 4-2 

Student Participation, 1994 Trial State Assessmeat 



Status 


Public 


Nonpublic 


Total 


Sampled 


130452 


10176 


140628 


Ori^al sample 


1265% 


10013 


136609 


Supplemental sample 


3856 


136609 


4019 


Withdrawn 


4805 


127 


4932 


Excluded 


8068 


121 


8189 


To be assessed 


117579 


9928 


127507 


Assessed 


112150 


9544 


121694 


Initial sessions 


111187 


9503 


120690 


Make-up sessions 


%3 


41 


1004 



80 



ERIC 



97 



Assessment administrators removed some students from *hs totr*} sample accoidui;^, to 
NAEP criteria: first, those students who had left theu' schcots slice the time that they were 
sampled (with.drawn); then, tliose judged incapable of participatirt, mean iigtuily m the 
assessment by school staff (excluded). Any student who had an ln;J’vidu 3 iiz«,d Education Plan 
(lEP) for reasons other than being gifted and talented or who *vas classified as Limited Engiivh 
Proficient (LEP) could be considered for exdusion. To be excii; led, art lEP student had to b«; 
”mainstreamed less than 50 percent of die time hi academic subi x’ts an,d/or judged incapable oi 
participating meaningfully in the asscissment.’ For an LEP student io be e-vdudvtd, he or she 
had to ’cfc "a native speaker of a language ether than English, and en-roDed m an English- 
speaking 'school (not including a bilingual educatior» pregrarn) for less than tv o years and ludgcd 
incapable of taking part in the assessment.’' 



These exclusions left 127,507 fourth graders to he assessed in reading. Of these, 121,674 
were actually assessed, yielding an unv/cighted student participation rate oi 95.4 percent. 



43.6 Results i»f the Observations 

During the assessment sessions, the quality control monitors noted instances when the 
assessment administrators deviated from the prescribed procedures and whether any of these 
de’dations were serious enough to warrant their intervention. Quality control monitors reported 
no instances where there were serious breaches of the procedures or major problems that would 
question ihe validity of the assessment. 

Deviation from prescribed procedures occurred most often in the administrator’s reading 
of the script that introduced the assessment and provided the directions. Even so, in at least 92 
percent of the observed sessions in the public and nonpublic schools, the assessment 
administrator read the script verbatim or with only slight deviations. Examples of major 
deviations included skipping sections of the script, adding substantially to the script, and 
forgetting to pass out materials at the appropriate times. The quality control monitor 
intervened in these instances. 

Most of the other procedures that could have had some bearing on the validity of the 
results were adhered to very well by the assessment administrators. In 99 percent of the 
observed public-school sessions and 98 percent of the observed nonpublic school sessions, the 
assessment administrators opened the bundles of booklets at the appropriate time and handled 
questions from the students correctly. Ninety-eight percent of the public-school sessions and 100 
percent of the nonpublic-school sessions were timed correctly. 

After the assessment session was over, assessment administrators were asked how they 
thought the assessment went and whether they had any comments or suggestions. Overall, 
assessment administrators stated that they thought 99 percent of the sessions went either very 
well or satisfactorily. This figure was consistent across the public and nonpublic schools, as well 
as for both monitored and unmonitored sessions. The percentage of assessment administrators 
who thought their session had gone "very well" was about three percentage points higher in the 
monitored sessions than in the unmonitored. 





Comments about the assessment materials and procedures were generally favorable. 
Criticisms or suggestions included that there were too many forms and too much paperwork; 
coding the booklet covers was tedious and problematic for students; and schools needed more 
information about NAEP and assessment results. 

In addition to these interviews, Westat sent a debriefing form to all of the NAEP state 
supervisors and met in person with half of them. This meeting produced suggestions for future 
assessments, especially many minor changes in the procedures, materials, and training plans. In 
addition, the state supervisors recommended that district and particularly school staff receive 
more information describing the background and objectives of NAEP and the Trial State 
Assessments. They also stated that many school staff were very interested in results for their 
students, or at least summary results for their jurisdiction. 

State coordinators were also sent a questionnaire about their experiences, suggestions, 
and comments, to which 39 coordinators responded. Generally, the state coordinators felt that 
the assessments went more smoothly than in the past. They also commented favorably on the 
training package and other materials. Like the assessment administrators, the state coordinators 
criticized the amount of work required to prepare for the assessments. They made many other 
suggestions about the computerized data system, sampling procedures, training program, and 
design of the assessment. All of these suggestions will be reviewed as future assessments are 
planned. 

ITie results of the assessment and comments from assessment administrators and state 
coordinators were summarized in a report presented to the NAEP Network in October 1994. 

At that time, each participating jurisdiction received a summary of its participation data, data 
collection activities, results of the assessment, and assessment administrators’ comments. 



82 



99 



o 

ERIC 



Chapter 5 

PROCESSING AND SCORING ASSESSMENT MATERIALS 



Patrick B. Bourgeacq, Charles L. Brungardt, Patricia M. Garcia Stearns, Tillie Kennel, 
Linda L. Reynolds, Timothy Robinson, Brent W. Studer, and Bradley J. ITiayer 

National Computer Systems 



5.1 OVERVIEW 

This portion of the report reviews the activities conducted by National Computer 
Systems (NCS) for the 1994 NAEP Trial State Assessment. The 1994 assessment was an 
exciting one for NAEP and NCS because of the introduction of image scoring to the assessment. 
The advent of image scoring eliminated alniost all paper handling during scoring and improved 
monitoring and reliability scoring. A short-term trend study was added to the assessment to 
compare the scoring of paper and scoring of images of student responses from both 1992 and 
1994. 



In the early 1990s, NCS developed and implemented flexible, innovatively designed 
processing programs and a sophisticated process control system that allowed the integration of 
data entry and work flow management systems. The planning, preparation, and quality- 
conscious application of these systems in 1992 and 1994 has made the NAEP project an exercise 
in coordinated teamwork and excellence. 

This chapter begins with a description of the various tasks performed by NCS, detailmg 
printing, distribution, receipt control, scoring, and processing activities. It also discusses specific 
activities involved in processing the assessment materials, and presents an analysis of several of 
those activities. The chapter provides documentation for the professional scoring 
effort — scoring guides, training papers, papers illustrating sample score points, calibration 
papers, calibration bridges, and interreader reliability reports. The detailed processing 
specifications and documentation of the NAEP process control system are presented later. 



5.1.1 Innovations for 1994 

Much of the information necessary for documentation of accurate sampling and for 
calculating sampling w'eights is collected on the admini.«tration schedules which, until 1993, were 
painstakingly filled out by hand by Westat administrative personnel. In 1994, for the fire, time, 
much of the work was computerized — booklets were preassigned and booklet ID numbers were 
preprinted on the administration schedule. When Westat personnel received the documents, 
they filled in only the "exception" information. This new method also permitted computerized 

83 



100 



o 



updating of information when the administration schedules were received at NCS, eliminating 
the need to sort and track thousands of pieces of paper through the processing stream. 

The introduction of image processing and image scoring further enhanced the work of 
NAEP. Image processing and scoring were successfully piloted in a side-by-side study conducted 
during the 1993 NAEP field test, and so became the primary processing and scoring methods for 
the 1994 Trial State Assessment. Image processing allowed the automatic collection of 
handwritten demographic data from the administrative schedules and the student test booklet 
covers through intelligent character recognition (ICR). This service was a benefit to the 
jurisdictions participating in NAEP because they were able to write rather than grid certain 
information — a significant reduction of burden on the schools. Image processing also made 
image scoring possible, eliminating much of the time spent moving paper. The images of 
student responses to be scored were transmitted electronically to the scoring center, located at a 
separate facility from where the materials were processed. 

The success of this new method of transferring data has moved NAEP closer to 
achieving another goal — the simultaneous scoring of constructed-response items at multiple 
locations. This process enhanced the reliability and monitoring of scoring and allowed both 
NCS and ETS to focus attention on the intellectual process of scoring student responses. 

Tables 5-1 and 5-2 give an overview of the processing volume and the schedule for the 
1994 NAEP Trial State Assessment. 



Table 5-1 

1994 NAEP Trial State Assessment Processing Totals 



Document/Category 


Totals 


Number of sessioas 


4,842 


Assessed student booklets 


122,052 


Absent student booklets 


5,810 


Excluded student booklets 


8,189 


lEP/LEP questionnaires 


17,118 


School questionnaires 


4,690 


Teacher questionnaires 


17,231 


Scanned documents 


62,058 


Scanned sheets 


2,544,434 


Key-entered documents 


0 



84 



O 

ERIC 



101 



Table 5-2 

1994 NAEP Trial State Assessment 
NCS Cchedule 



Activity 


Planned 
Start Date 


Planned 
Finish Date 


Actual 
Start Date 


Actual 
Finish Date 


Subcontractor meeting 


11/08/93 


11/09/93 


11/08/93 


11/09/93 


Network meeting to review items 


06/18/93 


06/18/93 


06/18/93 


06/18/93 


Printing 


09/02/93 


10/15/93 


09/02/93 


11/18/93 


NCS submits receipt-control specifications plan 


10/01/93 


10/01/93 


10/01/93 


10/01/93 


All reading materials at NCS 


10/15/93 


10/15/93 


11/18/93 


11/18/93 


Initial packaging begins 


10/15/93 


01/03/94 


10/15/93 


01/03/94 


Weekly status reports on receipt control and procedures 


10/18/93 


05/31/94 


10/18/93 


06/10/94 


State address file from Westat 


11/17/93 


11/17/93 


11/17/93 


11/17/93 


95% of public schools to NCS from Westat 


11/19/93 


11/19/93 


11/19/93 


11/19/93 


Print nonpublic-school Administration Schedule 


11/24/93 


11/24/93 


11/24/93 


11/24/93 


Print public-school Administration Schedule 


11/24/93 


11/24/93 


11/24/93 


11/24/93 


NCS ships Administration Schedules to public-school 
supervisors 


11/29/93 


11/29/93 


11/29/93 


11/29/93 


Ship public-school Administration Schedule 


11/29/93 


11/29/93 


11/29/93 


11/29/93 


Ship nonpublic-school Administration Schedule 


11/29/93 


11/29/93 


11/29/93 


11/29/93 


NCS ships Administration Schedules to nonpublic-school 
supervisors 


12/13/93 


12/13/93 


12/13/93 


12/13/93 


Materials due in districts 


01/02/94 


01/02/94 


01/02/94 


01/02/94 


Final packaging 


01/03/94 


02/18/94 


01/03/94 


02/18/94 


Distribution 


01/14/94 


02/18/94 


01/14/94 


02/18/94 


Public and nonpublic test administration 


01/31/94 


03/04/94 


01/31/94 


04/11/94 


Receiving 


02/01/94 


03/11/94 


02/01/94 


03/18/94 


Processing 


02/02/94 


04/25/94 


02/02/94 


05/02/94 


Scoring training preparation 


02/21/94 


03/11/94 


02/21/94 


03/11/94 


Project through clean post 


03/25/94 


04/08/94 


03/25/94 


04/15/94 


Constructcd-response scoring/training 


03/28/94 


05/13/94 


03/28/94 


05/27/93 


Ship weights data tape to Westat 


04/19/94 


04/19/94 


04/20/94 


04/20/94 


State questionnaires data tape delivered 


06/13/94 


06/13/94 


06/13/94 


06/13/94 


State reading data tape delivered 


06/16/94 


06/17/94 


06/17/94 


06/17/94 



85 




102 



PRINTING 



5 ^ 



52.1 Overview 

For the 1994 NAEP Trial State Assessment, 16 discrete documents were designed. More 
than 112,500 booklets and forms, totaling over 5.4 million pages, were printed. A list of these 
materials and key dates for their production is found in Table 5-3. 



The printing effort began in June 1993, with the design of the booklet covers and the 
administration schedule. This was a collaborative effort involving staff from ETS, Westat, and 
NCS. The covers were designed to facilitate the use of .r-,; ligent character recognition (ICR) 
to gather data. The administration schedule, which was designed to use both ICR and OMR 
(optical mark recognition), was the primary source of demographic data and also served as the 
session header for booklets when processed. Spaces for the same information were included on 
the student booklet cover as a backup source. For elements not individualized on the 
administration schedule (school number. Zip code, ILSQ number, and a "do not use" field), both 
hanchvritten information and OMR ovals were used on the booklet cover to assure complete, 
accurate data collection. 



522 Trial State Assessment Printing 

For the Trial State Assessments booklets, ETS provided one camera-ready copy of each 
unique cognitive block as well as of each set of directions and background sections. 

The printing effort for the Trial State Assessment materials began in June 1993, with the 
receipt of '^hort-term trend reading blocks. The same camera-ready copy was used for these 
blocks as was used in the 1992 assessment; only the block designation on each page was 
changed. Camera-ready copy of the other reading blocks and all directions and background 
sections followed in August. 

Because a large number of documents had to be printed in a relatively short period of 
time, preparatory work was started before all parts of the test booklets were received. Upon 
receipt of camera-ready materials from ETS, NCS made duplicate copies of each unique block 
and booklet component. These were then checked for consistency in design. An attempt was 
also made to proofread text and check response foils. Any problems or questions were referred 
to ETS personnel. Whenever possible, corrections or changes were made by NCS; other times 
replacement copy was supplied by ETS. During this time, the number of pages for each 
assessment booklet was calculated to ensure that no booklets would exceed size limitations. 

As each block was received and as many issues as possible resolved, camera-ready 
materials were sent to the NCS forms division along with a guide indicating the number of times 
each cognitive block and booklet component would be repeated in the assessment battery. 
Preliminary work such as adding gridding ovals for response options began and the required 
numbers of negatives for each block and booklet component were made. Performance of these 
preliminary tasks was crucial to meeting the delivery schedule. 



86 



O 



103 



The actual assembly of booklets began after all components for a particular booklet were 
received and the Office of Management and Budget had given its approval. Using mock-ups of 
booklets and "booklet maps" as guides, the NCS printer assembled prepared negatives into 
complete booklets. 

Rosters for teacher questionnaires, school questionnaires and lEP/LEP student 
questionnaires were designed by NCS and reviewed by ETS. After approval, NCS produced 
camera-ready copy and mounted it on layout sheets for printing. 

School and teacher questionnaires were the last materials to be printed. NCS mounted 
camera-ready pages of the questionnaires received from ETS on NCS Mark Reflex layout 
sheets. In some cases spacing of text and answer foils had to be adjusted so that the gridding 
ovals would appear in scannable positions. Portions of questionnaire pages requiring redesign 
were revised by NCS to include shaded boxes to make use of ICR technology and were 
submitted to ETS for approval. 

The printer forwarded proofs for each unique booklet for review by NCS and ETS 
personnel. Clean-up work, where necessary, was Indicated on the proofs. A content change in 
several blocks required multiple camera-ready copies that could be stripped into each affected 
booklet. ETS approved the proofs, and NCS reported this, along with any necessary changes, to 
the printer. Once approved, the booklets were printed in the colors agreed upon by NCS and 
ETS. Because reading booklets contained short-term trend items, the same colors were used as 
in the 1992 assessmen*. NCS and ETS personnel checked sample copies to check for color 
accuracy. Any booklets that did not meet color specifications were reprinted. 

As the booklets and forms were printed by vendors, pallets of documents were received 
and entered into NCS’s inventory control system. Sample booklets were selected and quality- 
checked for printing and collating errors. All printing for the 1994 NAEP Trial State 
Assessment was completed by November 30, 1993. 



53 PACKAGING AND SHIPPING 
53.1 Distribution 

The distribution effort for the 1994 NAEP Trial State Assessment involved packaging 
and mailing documents and associated forms and materials to individual schools. The NAEP 
materials distribution system, initially developed by NCS in 1990 to control shipments to the 
schools and supervisors, was enhanced and utilized. Files in this system contained the names 
and addresses for shipment of materials, scheduled assessment dates, and a listing of all 
materials available for use by a participant. Changes to any of this information were made 
directly in the distribution system file either manually by NCS staff or via file updates provided 
by Westat. The complex packaging effort, booklet accountability system, and on-line bundle 
assignment and distribution system is illustrated in Figure 5-1. 

Bar code technology. Introduced by NCS in the 1990 assessment, continued to be utilized 
in document control. To identify each document, a unique ten-digit numbering system was 



Figure 5-1 

1994 NAEP Trial State Assessment 




ERIC 



BEST COPY AVAILABLE 



89 











lOis 




devised, consisting of the Ihree-digit booklet number or form type, a six-digit sequential number, 
and a check digit. Each form was assigned a range of ID numbers. Bar codes reflecting this ID 
number were applied to the front cover of each document through NCS bar code technology 
using an ink jet printer. After administration of the assessment, as bar codes were read during 
the scanning process, the document ID number was incorporated into each student record. 

The booklets were then spiraled into bundles, according to the design specified by ETS. 
Bundles of 1 1 booklets were created in the pattern dictated by the bundle maps. The booklets 
were arranged in such a manner that each booklet appeared first in a bundle approximately the 
same number of times and the booklets were evenly distributed across the bundles. This 
assured that sample sizes of individual booklets would not be jeopardized if entire bundles were 
not used. Since all Administration Schedules for each scheduled session were preprinted with 
the booklet IDs designated for that session, only bundles of 11 booklets were created. Three 
bundles of booklets were preassigned to each session, giving each 33 booklets. This number 
most closely approximated the average projected session size of 30 students and allowed extra 
booklets either for additional students or to replace defective booklets. There were 16 unique 
spiral bundle types for the 1994 NAEP Trial State Assessment. 

Each group of 11 booklets had a bundle slip/header sheet that indicated the subject 
area, bundle type, bundle number, and a list of the booklet types to be included in the bundle, 
along with a list of any other essential materials to be used with the session. All booklets had to 
be arranged in the exact order listed on the bundle header sheet. To ensure the security of the 
NAEP assessments, the following plan was used to account for all booklets: All bundles were 
taken to a bar code reader/document transport machine where they were scanned to interpret 
each bundle’s bar codes. The file of scanned bar codes was then transferred from the personal 
computer connected to the scanner to a mainframe data set. 

The unique bundle number on the header sheet informs the system program as to what 
type of bundle should follow. A computer job was run to compare the bundle type expected to 
the sequence of booklets that was actually scanned after the header. This job also verified that 
the appropriate number of booklets was included in each bundle. Any discrepancies were 
printed on an error listing and forwarded to the packaging department. The error was corrected 
and the bundle was again read into the system. This process was repeated until all bundles were 
correct. As a bundle cleared the process, it was flagged on the system as ready for distribution. 
All bundles were shrink-wrapped in clear plastic, bound with plastic strips, and labeled "Do not 
open until 45 minutes before assessment." The bundles were then ready for distribution. 

Using sampling files provided by Westat, NCS assigned bundles to schools and 
customized the bundle slips and packing lists. File data from Westat was coupled with the file 
of bundle numbers and the corresponding booklet numbers. This file was then used to preprint 
all booklet identification numbers, school name, school number and session type, directly onto 
the scannable Administration Schedule. As a result, every session had specific bundles assigned 
to it in advance. This increased the quality of the booklet accountabilitj' system by enabling 
NCS to identify where any booklet should be at any time during the assessment. 

Distribution of materials was accomplished in waves according to the assessment date. 
Booklets were boxed by session, with the appropriate additional nonreusable materials inclf'^ 
with each session. If the quantities of materials received were insufficient to conduct the 



90 




lO'J 



assessment, additional materials could be requested by school supervisors via the NAEP toll-free 
line. 



Initially, a total of 5,182 sets of session materials were shipped for the 1994 NAEP Trial 
State Assessments. Approximately 143 additional shipments of booklets and miscellaneous 
materials were sent. All outbound shipments were recorded in the NCS outbound mail 
management system. A bar code containing the school number on each address label was read 
into the system, which determined the routing of the shipment and the charges. Information 
was recorded in a file on the system, which, at the end of the day, was transferred to the 
mainframe from a personal computer. A computer program could then access information to 
produce reports on shipments sent, regardless of the carrier used. 



532 Short Shipment and Phones 

A toll-free telephone line was maintained for administrators to request additional 
materials for the Trial State Assessments. To process a shipment, a clerk asked the caller 
information such as primary sampling unit, school ID, assessment type, city, jurisdiction, and Zip 
code. This information was then entered into the online short shipment system and a particular 
school and mailing address was displayed on the screen to be verified with the caller. The 
system allowed NCS staff to change the shipping address for individual requests. The clerk 
proceeded to the next screen, which displayed the materials to be selected. After the clerk 
entered the requested items, the due date, and the method of shipment, the system produced a 
packing list and mailing labels. Approximately 650 such calls were received regarding the Trial 
State Assessment. The number and types of calls are summarized in Table 5-4. 



Table 5-4 

1994 NAEP Trial State Assessment 
Phone Request Summary 



Number of Calls 


Request 


117 


Excluded Student Questionnaires/IEP/LEP 


318 


Teacher Questionnaires 


50 


Additional bundles (some due to increasing sessions or replenishing supervisor’s 
supply) 


11 


School Characteristics and Policies Questionnaire 


92 


Additional miscellaneous materials (some missing in original shipment, some due to 
increasing sessions or sample) 


34 


Change in administration date, disposition, session information, tracing unreceived 
shipments, general questions 



91 




I'O 




5.4 



PROCESSING 



5.4.1 Overview 

7 h-.. tc»Uowmg describes the stages of work involved in receiving and processing the 
documt,.it'': ;ised lii the 1994 Trial State Assessment, as illustrated in Figure 5-2. NCS staff 
created a set of pr,‘?Jevermined rules and specifications to be followed by the processing 
department.': v. Iihin .NCS. Project staff performed a variety of procedures on materials received 
from the as3t>sm&nt supervisors before releasing these materials into the processing system. 
Control system;? were used to monitor and route all NAEP materials returned from the field. 
The NAEP process control system contained the status of all sampled schools for aU sessions 
and their scheduled assessment dates. As materials were returned, the process control system 
was updated to indicate receipt dates, to record counts of materials returned, and to document 
any problems discovered in the shipments. As documents were processed, the system was 
updated to reflect the processed counts. NCS report programs allowed ETS, Westat, and NCS 
staff to monitor the progress and the receipt control operations. The processing flow is 
illustrated in Figure 5-2. 

An "alerts" process was utilized to record, monitor, and categorize all discrepant or 
problematic situations. Throughout the processing cycle, alert situations were identified based 
upon the processing specifications. These situations were either flagged by computer programs 
or identified using clerical procedures. All situations that could not be directly resolved by the 
staff involved in the given process were documented. A form describing the problem was 
completed and the information was forwarded to project personnel for resolution. 

NCS’s work flow management system was used to track batches of student booklets, 
school questionnaires, teacher questionnaires, lEP/LEP student questionnaires, and rosters 
through each processing step, allowing project staff to monitor the status of all work in progress. 
The work flow management system was also used by NCS to analyze the current work load, by 
project, across all work stations. By routinely monitoring these data, NCS’s management staff 
was able to assign priorities to various components of the work and monitor all phases of the 
data receipt and processing. 

NCS used a team approach to facilitate the flow of materials through all data processing 
steps. The image processing team checked in the materials from the field, created the batches 
to be scanned, scanned the booklets, edited the information when the program found errors or 
inconsistencies, selected quality control samples, and sent the completed batches to the 
warehouse for storage. Advantages to the team environment included less duplication of effort 
and improved quality control measures. 



5.42 Document Receipt and Tracking 

All shipments were to be returned to NCS packaged in the original boxes. As mentioned 
earlier, NCS packaging staff applied a bar code label to each box that indicated the NAEP 
school ID number. When the shipment arrived at the NCS dock area, this bar code was 
scanned to a personal computer file and sorted by assessment type. The shipment was then 



92 




111 



Figure 5-2 



1994 NAEP Trial State Assessment 
Materials Processing Flow Chart 




O 

ERIC 



BEST COPY AVAILABLE 



93 



113 




forwarded to the receiving area. The personal computer file was then transferred to the 
mainframe and the shipment receipt date was applied to the appropriate school within the 
process control system. This provided the current status of receipts regardless of any processing 
delays. The receipt was reflected on the control system status report provided to the receiving 
department and was also supplied to Westat via electronic data file transfer. 

The process control system could be updated manually to reflect changes. Receiving 
personnel also checked the shipment to verify that the contents of the box matched the school 
and session indicated on the label. Each shipment was checked for completeness and accuracy. 

If it was discovered that a shipment had not been received within seven days of the scheduled 
assessment date, project staff were alerted. Project staff would then check the administration 
status of the session and, in some cases, initiated a trace on the shipment. 

If multiple sessions were returned in one box, the contents of the package were removed 
and separated by session. The shipment was checked to verify that all booklets preprinted or 
hand-written on the Administration Schedule were returned with the shipment and that all 
administration codes matched from the booklet covers to the Administration Schedule. If 
discrepancies were discovered at any step in this process, the receiving staff issued an alert and 
held the session for resolution by the NAEP project staff. 

If a make-up session had been scheduled, receiving staff issued an information alert to 
facilitate tracking, and the documents were placed on holding shelves until the make-up session 
documents arrived. 

Once all booklets listed on the Administration Schedule for sessions containing scannable 
documents were verified as being present, the entire set of session materials, including the 
Administration Schedule and booldets, was forwarded to the batching area and a batch created 
on the work flow management system using the scannable Administration Schedule as a session 
header. The booklets were batched by grade level and assessment type. Each batch was 
assigned a unique batch number. The batch number, created on the image capture environment 
system and automatically uploaded to the work flow management system, facilitated the internal 
tracking of the batches and allowed departmental resource planning. All other scannable 
documents, questionnaires, and rosters were batched by document type in the same manner. 

The batched documents were then forwarded to the scanning area, where all information 
on the Administration Schedule and booklets were scanned via a W201 image scanner. All 
information from the Administration Schedules was read by the intelligent character recognition 
engine and verified by online editing staff. Information gathered throughout this process, which 
included the school number, session code, counts of the students in original sample, 
supplemental sample, and total sample; numbers of students withdrawn, excluded, to be 
assessed, absent, assessed in original, and assessed in makeup; and total number of assessed 
students was transferred electronically to Westat on a weekly basis to produce participation 
statistics. 

Two rosters were used to account for all questionnaires. The Roster of Questionnaires 
recorded the distribution and return of the school questionnaire and the lEP/LEP student 
questionnaire. The Roster of Teacher Questionnaires recorded teacher questionnaires 
distributed and returned for their respective students. Some questionnaires may not have been 

94 




114 



available for return with the shipment. These were returned to NCS at a later date in an 
envelope provided for that purpose. The questionnaires were submitted for scanning as 
sufficient quantities became available for batching. 

Receipt of the questionnaires was entered into the system using the same process used 
for the Administration Schedules. The rosters were grouped with other rosters of the same type 
from other sessions, and a batch was created on the image capture environment system. The 
batch was then forwarded to scanning where all information on the rosters was scanned into the 
system. 



A sophisticated booklet accountability system was used to track all booklets distributed. 
Prior to the distribution of materials, unique booklet numbers were read into a file by bundle. 
This file was then used to control distribution by assigning specific bundles to supervisors or 
schools. This assignment was recorded in the materials distribution system. 

When shipments were received, the used booklets were submitted to processing. Unused 
booklets were batched and their booklet ID bar codes were read into a file by the bar code 
scanner. This file and the processed documents file were later compared to the original bundle 
security file. A list of unmatched booklet IDs was printed in a report that was used to confirm 
nonreceipt of individual booklets. At the end of the assessment period, the supervisors returned 
all unused materials. When these materials were returned, the booklet IDs were read into a file 
by the bar code scanner. Any major discrepancies were directed to Westat for follow-up. The 
unused materials were then inventoried and sent to the NCS warehouse for storage. 

The Receipt Control Status Report displayed the current status of all schools. This 
report could be sorted by school number or by scheduled administration date. As the receiving 
status of a school was updated through the receiving, opening, and batching processes, the data 
collected were added to this report. Data represented on this report included participation 
status, shipment receipt date, and receipt of the Roster of Questionnaires. The comment field 
in this report showed any school for which a shipment had not been received within seven days 
of the completion of the assessment administration. 



5.4 J Data Entry 

The transcription of the student response data into machine-readable form was achieved 
through the use of three separate systems: 1) data entry, which included optical mark 
recognition scanning, image scanning, intelligent character recognition), and key entry; 2) 
validation (edit); and 3) resolution. 

The data entry process was the first point at which booklet level data were directly 
available to the computer system. Depending on the NAEP document, one of two methods was 
used to transcribe NAEP data to a computerized form. The data on scannable documents were 
collected using NCS optical scanning equipment and also captured images of the constructed 
response items. Nonscannable materials were keyed through an interactive online system. In 
both of these cases, the data were edited and suspect cases were resolved before further 
processing. 



95 




O 

ERIC 



All student booklets, questionnaires, and control documents were scannable documents. 
Throughout all phases of processing, the student booklets were batched by grade and session 
type. The scannable documents were then transported to a slitting area where the folded and 
stapled spine was removed from the document. This process utilized an "intelligent slitter" to 
prevent slitting the wrong side of the document. The documents were jogged by machine so that 
the registration edges of the NAEP documents were smoothly aligned, and the stacks were then 
returned to the cart to be scanned. The bar code identification numbers used to maintain 
process control were decoded and transcribed to the NAEP computerized data file. 

During the scanning process (shown in Figure 5-3), each scannable NAEP document was 
uniquely identified using a print-after-scan number consisting of the scan batch number and the 
sequential number within the batch. The number was assigned to and printed on one side of 
each sheet of each document as it exited the scanner. This permitted the data editors to quickly 
and accurately locate specific documents during the editing phase. The print-after-scan number 
remained with the data record and provided a method for easy identification and quick retrieval 
of any document. 

The data values were captured from the booklet covers and Administration Schedules 
and were coded as numeric data. Unmarked fields were coded as blanks and processing staff 
were alerted to missing or uncoded critical data. Fields that had multiple marks were coded as 
asterisks. The data values for the item responses and scores were returned as numeric codes. 
The multiple-choice, single response format items were assigned codes depending on the 
position of the response alternative; that is, the first choice was assigned the code "1," the 
second was assigned "2," and so forth. The mark-all-that-apply items were given as many data 
fields as response alternatives; the marked choices were coded as "1" and the unmarked choices 
as blanks. The images of constructed response items were saved as a digitized computer file. 
The area of the page that needed to be clipped was defined prior to scanning through the 
document definition process. The fields from unreadable pages were coded "X" as a flag for 
resolution staff to correct. 

As the scanning program completed scanning each stack, the stack was removed from 
the output hopper and placed in the same order on the output cart. The next stack was 
removed from the cart, placed into the input hopper, and the scanning resumed. When the 
operator had completed processing the last stack of the batch, the program was terminated. 

This closed the dataset, which automatically became available for the edit process. The scanned 
documents were then forwarded to a holding area in case they needed to be retrieved for 
resolution of edit errors. 

An intelligent character recognition engine was used to read various hand and machine 
print on the front cover of the assessment and supervisor documents. Information from the 
Administration Schedule, the Rosters of Questionnaires, and some questions in the school 
questionnaire were read by the engine and verified by a key entry operator. Analysis by NCS 
development staff of the accuracy of characters read via intelligent character recognition 
determined that the recognition engine read as well as two people processing information using 
a key entry and 100 percent verily method of data input. In all, the intelligent character 
recognition engine read nearly 6 million characters for the 1994 NAEP Trial State Assessment. 
This saved NAEP field staff and school personnel a significant amount of time since they no 
longer had to enter this data by gridding rows and columns of data. 

96 






O 



Figure 5-3 



1994 NAEP Trial State Assessment 
Image Scanning Flow Chart 




97 




117 



Work 

Station 



To provide yet another quality check on the image scanning and scoring system, NCS 
staff implemented a quality check process by creating labels with a valid score designated on 
them. Each unique item scored via the image system had two quality control labels per valid 
score. These labels were attached to blank, unused booklets by clerical staff and sent through 
the scanning process. An example of the label used is given below. 



assessed student booklets, they were sent through the same process as the student document" 
Since all of a specific item are batched together for transmission to the scoring facility, the 
labeled responses were integrated with and transmitted simultaneously to the scoring facility 
with the student responses. During the scoring process, both student responses and the quality 
control items were randomly displayed so scores could be applied. 

When a reader saw the quality control label on the monitor, he or she notified the team 
leader to watch and confirm the score while the reader assigned the score given on the label. 

The quality control booklets were included in the pool of aU items to be drawn from for the 25 
percent reliability rescore. Analysis of the data captured from this quality assurance process 
showed 100 percent accuracy in the system software design for capturing scores assigned to 
constructed-response items and linking them back to the original student document. 

A key entry and verification process was used to make corrections to the teacher 
questionnaires and the lEP/LEP student questionnaires. The Falcon system that was used to 
enter this data is an online data entry system designed to replace most methods of data input 
such as keypunch, key-to-disk, and many of the microcomputer data entry systems. The 
terminal screens were uniquely designed for NAEP to facilitate operator speed and convenience. 
The fields to be entered were titled to reflect the actual source document. 



5.4.4 Data Validation 

NCS used the same format used in the 1992 assessment and the 1993 field test to set up 
the document definition files for the large numbers of unique documents used in the 1994 
assessment. To do the proper edits, a detailed document definition procedure was designed to 
allow NCS to define an item once and use it in many blocks and to define a block once and 
used it in many documents. The procedure used was a document file that pointed to the 
appropriate blocks on a block file that pointed to appropriate items on an item file. With this 
method of definition, a document was made up of blocks, which were made up of items. 

Each dataset produced by the scanning system contained data for a particular batch. 
These data had to be edited for type and range of response. The data entry and resolution 
system used was able to process a variety of materials from all age groups, subject areas, control 




IMAGE SCORING 
QUALITY ASSURANCE 



SAMPLE 



SCORE - 



( 



Although the quality control booklets were batched and processed separately from 



98 



lib 



documents, and questionnaires simultaneously, as the materials were submitted to the system 
from scannable and nonscannable media. 

The data records in the scan file were organized in the same order in which the paper 
materials were processed by the scanner. A record for each batch header preceded all data 
records for that batch. The document code field on each record distinguished the header record 
from the data records. 

When a batch header record was read, a pre-edit data file or an edit log was generated. 
As the program processed each record within a batch from the scan file, it wrote the edited and 
reformatted data records to the pre-edit data file and/or recorded all errors on the edit log. 

The data fields on an edit log record identified each data problem by the batch sequence 
number, booklet serial number, section or block code, field name or item number, and data 
value. After each batch had been processed, the program generated a listing or online edit file 
of the data problems and resolution guidelines. An edit log listing was printed at the 
termination of the program for all non-image documents and image "clips" were routed to online 
editing stations for those documents that were image-scanned. 

As the program processed each data record, it first read the booklet number and 
checked it against the session code for appropriate session type. Any mismatch was recorded on 
the error log and processing continued. The booklet number was then compared against the 
first three digits of the student identification number. If they disagreed, a message was written 
to the error log. TTie remaining booklet cover fields were read and validated for the correct 
range of values. The school codes had to be identical to those on the process control system 
record. All data values that were out of range were read "as is" but flagged as suspect. All data 
fields that were read as asterisks were recorded on the edit log or online edit file. 

Document definition files described each document as a series of blocks, which in turn 
were described as a series of items. The blocks in a document were transcribed in the order 
that they appeared in the document. Each block’s fields were validated during this process. If a 
document contained suspect (out-of-range) data, the cover information was recorded on the edit 
log, along with a description of the suspect data. The edited booklet cover was transferred to an 
output buffer area within the program. As the program processed each block of data from the 
dataset record, it appended the edit 'd data fields to the data already in this buffer. 

The program then cycled through the data area corresponding to the item blocks. The 
task of translating, validating, and reporting errors for each data field in each block was 
performed by a routine that required only the block identification code and the string of input 
data. This routine had access to a block definition file that had, for each block, the number of 
fields to be processed, and, for each field, the field type (alphabetic or numeric), the field width 
in the data record, and the valid range of values. The routine then processed each field in 
sequence order, performing the necessary translation, validation, and reporting tasks. 

The first of these tasks checked for the presence of blanks or asterisks in a critical field. 
These were recorded on the edit log or online edit file and processing continued with the next 
field. No action was taken on a blank field for multiple-choice items inasmuch as that code 
indicated a nonresponse. The field was validated for range of response, and any values outside 
of the specified range were recorded to the edit log or online. The program used the item type 





99 



code to make a further distinction among constructed-response item scores and other numeric 
data fields. 

Moving the translated and edited data field into the output buffer was the last task 
performed in this phase of processing. 

When the entire document had been processed, the completed string of data was written 
to the data file. When the program encountered the end of a file, it closed the dataset and 
generated an edit listing for non-image and key-entered documents. Image scanned items which 
required correction were displayed on an online editing terminal. 

Accuracy checks were performed on each non-image batch processed. The record of 
every 500th document of each booklet/document type was printed in its entirety, with a 
minimum of one document type per batch. This record was checked, item by item, against the 
source document. If inconsistencies were found, project personnel were contacted and 
processing stopped. 



5.4^ Editing for Non-image and Key-entered Documents 

Throughout the system, quality procedures and software ensured that the NAEP data 
were correct. The machine edits performed during data capture verified that each sheet of each 
document was present and that each field had an appropriate value. All batches entered into 
the system, whether key entered or machine scanned, were checked for completeness. 

Data editing took place after these checks. This consisted of a computerized edit review 
of each respondent’s record and the clerical edits necessary to make corrections based upon the 
computer edit. This data editing step was repeated until aL' data fell within a valid range. 

The first phase of data editing was designed to validate the population and ensure that 
all documents were present. A computerized edit list, produced after NAEP documents were 
scanned or key entered, and all the supporting documentation sent from the field were used to 
perform the edit function. The hard copy edit list contained all the vital statistics about the 
batch. The number of students, school code, type of document, assessment code, error rates, 
suspect cases, and record serial numbers were among these elements. Using these inputs, the 
data editor verified that the batch had been assembled correctly and each school number was 
correct. 



During data entry, counts of processed documents were generated by type. These counts 
were balanced against the information captured from the administration schedules. The number 
of assessed and absent students processed had to match the numbers indicated on the process 
control system. 

In the second {xhase of data editing, an experienced editing staff used a predetermined 
set of specifications to review the field errors and record any necessary correction to the student 
data file. The same computerized edit list used in the first phase was used to perform this 
function. The process was as follows; 



100 



120 



The editing staff reviewed the edit log prepared by the computer and the area of the 
source document that was noted as being "suspect" or containing possible errors. The current 
composition of the field was shown in the edit box. The editing staff checked this piece of 
information against the NAEP source document. At that point, one of the following took place: 

Correctable error. If the error was correctable by the editing staff according to the 
editing specifications, the corrections were noted on the edit log. 

Alert: If an error was not correctable according to the specifications, an alert was 
issued to the operations coordinator for resolution. Once the correct information 
was obtained, the correction was noted on the edit log. 

Noncorrectable error. If a suspected error was found to be correct as stated and no 
alteration was possible according to the source document and specifications, the 
programs were tailored to allow this information to be accepted into the data record 
and no corrective action was taken. 

The corrected edit log was then forwarded to the key entry staff for processing. When 
all corrections were entered and verified for a batch, an extract program pulled the corrected 
records into a mainframe dataset. At this point, the mainframe edit program was initiated. The 
edit criteria were again applied to all records. If there were further errors, a new edit listing 
was printed and the cycle began again. 

When the edit process had produced an error-free file, the booklet ID number was 
posted to the NAEP tracking file by age, assessment, and school. This permitted NCS staff to 
monitor the NAEP processing effort by accurately measuring the number of documents 
processed by form. The posting of booklet IDs also ensured that a booklet ID was not 
processed more than once. 



5.4.6 Data Validation and Editing of Image-processed Documents 

The paper edit log was replaced by online viewing of suspect data for all image- 
processed documents. The edit criteria for each item or items in question also appeared on the 
screen at the same time the suspect item was displayed for rapid resolution. Corrections were 
made at this time. The system employed an edit/verify system which ultimately enabled two 
different online-edit operators to view the same suspect data and work on it separately. The 
"verifier" must make sure that the two responses (one from either the "entry" operator or the 
intelligent character recognition engine) were the same before the system would accept that item 
as being corrected. The verifier was able to overrule or agree with the original correction made 
if the two were discrepant. If the editor was unable to determine the appropriate response, he 
or she escalated the suspect situation to a supervisor. 

When an entire batch was through the edit phase, it was then eligible for the count 
verification phase. The administration schedule data were examined systematically for booklet 
IDs that should have been processed (assessed, absent, and excluded administration codes). The 
documents under an individual administration schedule were then inspected to ensure that all of 
the booklet IDs listed on it were present. 




With the satisfactory conclusion of the count verification phase, the edited batch file was 
uploaded to the mainframe where it went through yet another edit process. A paper edit log 
was then produced, and, if errors remained, the paper edit log was forwarded to another editor. 
When this edit was satisfied, the appropriate tracking mechanisms (the process control and work 
flow management systems) were updated. 



5.4.7 Data Transmission 

Due to the rapid pace of scoring on an item-by-item basis, the NCS scoring specialists 
found it necessary to continually monitor the status of work available to the readers and plan 
the scoring schedule several weeks in advance. On Wednesday of each week, the NCS scoring 
specialist planned the schedule for the next two weeks. That information was then provided to 
the person in charge of downloading data to the scoring center. By planning the scoring 
schedule two weeks in advance, the scoring specialists were able to ensure that re'»ders would 
have sufficient work for at least one week, after which the next download would occur to 
supplement the volume of any unscored items and add an additional week’s work to the pool of 
items to score, Additionally, by scheduling two weeks’ data for transmission, flexibility was 
added to the scoring schedule, making it possible to implement last minute changes in the 
schedule once the items had been delivered to the scoring center. Depending on the number of 
items to be transmitted, the actual scheduling was conducted on Friday or divided into two 
smaller sessions on Thursday and Friday. 

Delivery of data to the scoring center — ^located approximately five miles from NCS’s 
main facility in Iowa City— -was accomplished via several T1 transmission lines linking the 
mainframe computers and the NAEP servers at the site of document scanning in the main 
facility, with the scoring servers dedicated to distributing work to the professional readers at the 
scoring center. The actual task of scheduling items for downloading was accomplished using 
code written by the image software development team. This code enabled the person scheduling 
the downlead to choose a team of readers and select the scheduled items from a list of all items 
that team would be scoring throughout the scoring project. This process was repeated for all 
teams of readers until all anticipated work was scheduled. Once this task was completed, the 
scheduled job was tested to determine if sufficient free disk space existed on the servers at the 
scoring center. If, for any reason, sufficient disk space was not available, scheduled items could 
be deleted from the batch individually or as a group until the scheduled batch job could 
accommodate all items on the rvailable disk space at the scoring center. Once it was 
determined that there was sufficient disk space, transmission of student responses commenced. 
Data transmission was typically accomplished during off-shift hours to minimize the impact on 
the system’s load capacity. 



5.5 PROFESSIONAL SCORING 
5.5.1 Overview 

Scoring of the 1994 NAEP Trial State Assessment constructed-response items was 
conducted using NCS’s image technology, All 1994 responses were scored online by readers 



102 

122 




working at image stations. The logistical problems associated with handling large quantities of 
student booklets were removed for those items scored on the image system. 

One of the greatest advantages image technology presented for NAEP scoring was in the 
area of sorting and distributing work to scorers. All student responses for a particular item, 
regardless of where spiraling had placed that item in the various booklet forms, were grouped 
together for presentation to a team of readers. This allowed training to be conducted one item 
at a time, rather than in blocks of related items, thus focusing readers’ attention on the 
complexities of a single item. 

A number of tools built into the system allowed table leaders and trainers to closely and 
continuously monitor reader performance. A detailed discussion of these tools can be found 
later in this chapter. 

The system automatically routed 25 percent of student responses to other members of 
the team for second scoring. Readers were given no indication of whether the response had 
been scored by another reader, thereby making the second scoring truly blind. On-demand, 
real-time reports on interreader reliability (drawn from those items that were second-scored) 
presented extremely valuable information on team and individual scoring. Information on 
adjacent and perfect agreement, score distribution, and quantity of responses scored were 
continuously available for consultation. Similarly, back-reading of student responses could be 
accomplished in an efficient and timely manner. Table leaders were able to read a large 
percentage of responses, evaluating the appropriateness and accuracy of the scores assigned by 
readers on their teams. 

Project management tools assisted table leaders in making well-informed decisions. For 
example, knowledge of the precise number of responses remaining to be scored for a particular 
item allowed table leaders to determine the least disruptive times for lunch breaks. 

Concerns about possible reader fatigue or other problems that might result from working 
continuously at a computer terminal proved unfounded. Both readers and table leaders 
responded with enthusiasm to the system, remarking on the ease with which student responses 
could be read and on the increased sense of professionalism they felt in working in this 
technological environment. Readers took periodic breaks, in addition to their lunch break, to 
reduce the degree of visual fatigue. Readers were grouped in teams of 6 to 10 readers per 
team. Individual rooms were set up with each room containing teams for a single subject area. 



SSJ, Training Paper Selection 

A pool of papers to be used during training for the national main assessment was 
selected by NCS staff in February 1994. During the interview process, NCS scoring specialists 
identified those candidates with team leader potential. Individuals recruited to be team leaders 
during the actual scoring were asked to select student responses to send to ETS test 
development specialists, who created the master training set. Team leaders were used for this 
task because it gave them the advantages of working on specific items, learning the make-up of 
the various booklets, learning the terminology, and understanding the processing of the booklets 



103 




123 



at NCS. This was especially important in 1994, because most scoring activities occurred via the 
image processing system. 

The training set for each short (two- or three-point) item included 40 papers: 

• 10 anchor papers 

» 20 practice papers 

• 5 papers in calibration set #1 

• 5 papers in calibration set #2 

The training set for each extended (four-point) item included 85 papers: 

• 15 anchor papers 

• 40 practice papers 

• 10 papers in each of two qualification sets 

• 5 papers in calibration set # 1 

• 5 papers in calibration set #2 

To ensure that the ETS test development specialist would have a wide range of student 
responses to encompass all score points, NCS personnel copied approximately 100 papers for 
each two- or three-point item and 200 papers for each four-point item. To ensure that training 
papers represented the range of responses obtained from the sample population, NCS personnel 
selected papers randomly from across the sample. The student identifier (barcode) was written 
on the copy. The responses were numbered sequentiaiiy, copied, and sent via overnight delivery 
to ETS. V^en the training packet was compiled, the ETS test development specialist faxed the 
composition of the packets back using the sequential numbers. ETS staff kept its copy of the 
training sets. A total of 4,100 student responses were forwarded to ETS to be used in the 
creation of training packets. 

From the faxed sheets, packets were created for each item using the first generation 
copy. These packets were then forwarded to the NCS communication center for copying, and 
stored for the team’s use in training. ETS also sent the most up-to-date version of the scoring 
guide for each item to be included in the scoring guide. 



5.5J General Training Guidelines 

ETS personnel conducted training for the constructed-response items on an item-by-item 
basis, so that each item could be scored immediately after training. Reading items tied to a 
common stimulus were trained and scored sequentially, finishing one block before proceeding to 
the next. 

In all, 13 team leaders and 120 readers worked from March 28 to May 27, 1994 to 
complete scoring for the 1994 NAEP Trial State Assessment. Each member of a team received 
a copy of the stimulus and training materials for the items which his or her team would be 
scoring. Before training, each team member read the stimulus and discussed it under the 
guidance of the trainer where applicable. Next, ETS staff conducted training sessions to explain 
the anchor papers, exemplifying the various score point levels. The team proceeded with each 

104 



124 



member scoring the practice papers, and then discussing those papers as a group while the 
trainer clarified issues and answered questions. The papers selected for each training set were 
chosen to illustrate a range from easily classifiable responses to borderline responses for each 
score point. 

When the trainer was confident the readers were ready to begin scoring short 
constructed responses, the table leader signaled the system to release the responses to the team 
members who had successfully completed training. For extended constructed-response items, 
each team member was given a qualifying set which had been prescored by the trainer in 
conjunction with the table leader. Readers were required to score an exact match on 80 percent 
of the items in order to qualify for scoring. If a reader failed on the first attempt, the trainer 
discussed the discrepant scores with the reader and administered a second qualifying set. Again, 
80 percent exact agreement was required to score the item. During the beginning stages of 
scoring, the team members discussed student responses with the trainer and table leader to 
ensure that issues not addressed in training were handled in the same manner by all team 
members. 

After the initial training, readers scored the items, addressing questions to the table 
leader and/or trainer when appropriate. Depending upon n-counts, length of responses and 
complexity of the rubric, scoring of an individual item ranged anywhere from one^half hour to 
two weeks. Whenever a break longer than 15 minutes occurred in scoring, each team member 
received a set of calibration papers which had been prescored by the trainer and table leader. 
Each team member scored the calibration set individually, and then the team discussed the 
papers to ensure against scorer drift. 



5,5.4 Table Leader Utilities and Reliability Reports 

Among the many advantages of the image scoring system is the ease with which work 
flow to readers can be regulated and scoring can be monitored. One of the utilities at a table 
leader’s disposal was a qualification algorithm executed upon completion of training on an 
extended constructed-response item. At that time, a table leader passed out a qualification 
packet of 10 papers whose scores had been entered as a master key on the table leader’s 
workstation. Upon completion of the packet, the table leader entered each reader’s scores into 
the computer for tabulation and the computer calculated each reader’s percent of exact, 
adjacent, and nonadjacent agreement with the master key. If a reader had a percent of exact 
agreement above a pi ^determined threshold, the reader was authorized to begin scoring that 
item. Readers not reaching the predetermined threshold were handled on a case-by-case basis, 
typically receiving individual training by the ETS trainer or the NCS table leader before being 
allowed to begin scoring. A table leader also had the authority to cancel a reader’s qualification 
to score an item if review of a reader’s work indicated the reader was scoring inaccurately. 

After scoring commenced, review of each reader’s progress was conducted using a back- 
reading utility that allowed a table leader to review every paper scored by each reader on the 
table. Typically a table leader would choose the ID number of a reader and review a minimum 
of 10 percent of the responses scored by that reader, making certain to note the score the 
reader awarded each response as well as the score a second reader gave that same paper as an 
interreader reliability check, Alternately, a table leader could select to review all responses 

105 




125 



receiving a specific score in order to determine if the whole team was scoring consistently. Both 
review methods utilized the same display screen and revealed the ID number of the reader and 
the score awarded. If the table leader disagreed with the score given a response, the table 
leader would discuss the discrepancy with the reader and possibly replace the score of the 
questionable response. Scores were replaced by the table leader only when the scorer had made 
an obvious error. The main purpose of this monitoring was to provide early identification of 
problems and opportunities to retrain scorers when needed. 

A minimum of 25 percent of the 1994 reading responses were scored twice. The image 
system presented all responses in the same manner, so the reader could not discern which 
responses were being first-scored and which were designated for a second scoring. The table 
leader and the ETS trainer were able to monitor these figures on demand. The system showed 
the overall reliability for the group scoring the item and individual reliability of the qualified 
readers. 

During the scoring of an item, the table leader could monitor progress using an 
interreader reliability tool. This display tool could be used in either of two modes — to display 
information of first readings versus second readings, or to display first reading of an individual 
versus second readings of that individual. 

The table leaders were able to monitor work flow using a status tool that displayed the 
number of items completed, the number of items that still needed second scoring, and the 
number of items that had not been scored up to that time. 

Table 5-5 shows the number of constructed-response items falling into each range of 
percentages of exact agreement. Tables 9-2 and 9-3 in Chapter 9 show more reliability 
information about the constructed-response items used in the NAEP scale. 



Table 5-5 

1994 NAEP Trial State Assessment 
Number of Constructed-response Items 
in Each Range of Percentages of Exact Agreement Between Readers 



Grade 4 


Number of 










Reading Items 


Unique Items 


60-69% 


70-79% 


80-89% 


90-100% 


Short constructed-response items 


37 


0 


0 


8 


29 


Extended constructed-response items 


8 


0 


1 


6 


1 



Main and Trial State Reading Assessment 

It is important to note that the student responses in the fourth-grade reading 
assessments were scored concurrently for the national and the state samples. Another 
advantage of image-based item-by-item scoring is that the comparability of the scoring of the 
two samples is ensured since all responses are scored simultaneously and in a manner which 



106 




126 ' 



makes is impossible for the scorers to know from which sample any individual response is. 
Because of this, the following discussion addresses both national (main) and state reading. 



5^.6 Training for the Main and State Reading Assessment 

The reading assessment followed the basic training procedures outlined in section 5.5.4, 
One trainer provided all the training for the fourth-grade items scored. One trainer followed 
the fourth-grade items through from beginning to end. 



5 J.7 Scoring the Main and State Reading Assessment 

Each constructed-response item had a unique scoring standard that identified the range 
of possible scores for the item and defined the criteria to be used in evaluating the students’ 
responses. Point values were assigned with the following meanings; 

Dichotomous items from the 1992 assessment 

• 1 = Unacceptable 

• 4 = Acceptable 

Dichotomous items developed during the 1993 field test 

• 1 = Evidence of little or no comprehension 

• 3 = Evidence of full comprehension 

Three-point items developed during the 1993 field test 

• 1 = Evidence of little or no comprehension 

• 2 = Evidence of partial or surface comprehension 

• 3 = Evidence of full comprehension 

All four-point items 

• 1 = Evidence of unsatisfactory comprehension 

• 2 = Evidence of partial comprehension 

• 3 = Evidence of essential comprehension 

• 4 = Evidence of extensive comprehension 

The scores for these items also included a 0 for no response, 8 for an erased or crossed- 
out response, and a 9 for any response found to be unratable (i.e., illegible, off-task, responses 
written in a language other than English, or responses of "1 don’t know"). 

During scoring, the table leaders compiled notes on various responses for the readers’ 
reference and guidance and for the permanent record. In addition, trainers were accessible for 
consultation in interpreting the guides for unusual or unanticipated responses. The table leaders 
conducted constant online back-reading of all team members’ work throughout the scoring 

107 




C 4 



process, bringing to the attention of each reader any problems relating to scoring. When 
deemed appropriate, scoring issues were discussed among the team as a whole. Table leaders 
also monitored n-counts of responses scored and individual and team reliability figures 
throughout the course of scoring. 

Each item was scored by a single team immediately after training for that item. Team 
sizes averaged 10 scorers. 

Grade 4 items came from both a national and a state-by-state sample. Responses were 
delivered by image in such a way that the student demographics were unknown to the reader. 
Thus, readers did not know from which sample any given item came when it appeared on the 
screen. In the case of overlap items, all readers scored responses at both grade levels. 

5,5.8 1992 Short-Term Trend and Image/Paper Special Study 

Sixteen blocks from the 1992 reading assessment were re-used in the i994 assessment to 
provide data with which to study trends over time. To accomplish this, a random sample of 
responses from the 1992 assessment were pulled from the warehouse for reseoring to determine 
whether or not the scoring performed in 1994 was comparable to the scoring performed :n 1992. 
For the national sample, ETS measurement personnel identified three booklets at grade 4, four 
booklets at grade 8, and five booklets at grade 12 which contained all of the blocks needed for 
the study. The entire sample of those booklets was used for the re.''core study. Since each block 
appears in four booklets, rescoring the entire sample of one booklet resulted in a 25 percent 
rescore of the responses from 1992. For the state sample, 12 booklets were pulled for each 
unique booklet type (R30 through R45) for each of the 41 jurisdictions which participated in 
both the 1992 and 1994 Trial State assessments. Since each block appeared in four different 
booklet types, 48 responses to each item were rescored for each jurisdiction. These booklets 
were scanned to capture the same clip areas used for the 1994 responses. Thus they appeared 
identical to the reader when viewing them on the monitor and were presented at the same time 
as the 1994 responses. 

After scanning was completed, the national sample of the rescore booklets was 
transported to the scoring facilities to be scored on paper. Paper scoring took place at the same 
time as image scoring. This process yielded data to compare the paper-based scoring done in 
1992, the paper-based scoring done in 1994, and the image-based scoring done in 1994. 

Analyses performed on these data will be documented in The NAEP 1994 Technical Report. 



5,5.9 Calibration Bridges 

Unanticipated delays in receipt and processing of student booklets resulted in a situation 
in which scoring for some constructed-response items began before all or most of the student 
responses for those items were available for scoring. The result was that the responses for most 
of the 1994 constructed-response items were scored in two different scoring sessions ("sweeps"). 
To maintain the highest standards of scoring and measurement precision and to ensure that 
calibration error was not introduced as a result of the split scoring sessions and the time elapsed 
between them, a plan was devised to calibrate the scoring of sweeps 1 and 2. In some instances. 




108 



it was determined that scoring could resume with a review of training and a regular calibration 
set to ensure consistency and reliability. In other instances, a calibration bridge was constructed 
to provide statistical linkage between the two scoring sessions. 

It was determined that scoring could continue without the calibration bridge in those 
instances in which completed scoring had met two criteria: 50 percent scored on the first sweep 
and interreader reliability equal to or greater than 95 percent. 

For those items not meeting these criteria, a set of papers was scored to provide a 
reliability link or calibration bridge between completed scoring ("first sweep") and subsequent 
scoring ("second sweep"). The procedures followed for completing the calibration bridge were as 
follows: 

1. Approximately 12,500 processed booklets were pulled from inventory and, from 
them, samples of student responses were constructed for each item designated for 
the calibration bridge scoring. 

2. A file of all pulled booklet ID numbers was created along with all scores assigned 
in the first sweep of scoring. This allowed for matching scores assigned in the 
first sweep to those given in the paper-based calibration bridge rescoring. 

3. For each designated item, each scorer read and scored at least 10 student 
responses drawn from this sample (10 different papers for each scorer). No 0 
score papers v/ere included, and 20 percent of the responses were scored twice 
for interreader reliability. 

4. The clerical support staff entered the scores in a spreadsheet program which 
produced data on reader agreement, score distributions, mean scores, and 
standard deviations of the mean scores. 

5. The data from the calibration bridge scoring was compared to the data for the 
first sweep scoring on the same item. 

6. After reviewing these data, items meeting the following criteria 
were determined to be ones for which second sweep scoring could 
then proceed: 

• items for which the Diff T test was not significant, >.05, e.g. the null 
hypothesis cannot be rejected. 

• items for which the bridge/sweep percent agreement was higher than the 
designated threshold of 90 percent reliability for two-point items, 80 
percent reliability for three-point items, and 75 percent reliability for four- 
point ite.us. 

• items for which the bridge interreader reliability was no more than six 
percentage points lower than first sweep interreader reliability. 



109 



129 




7. For those items not meeting the criteria, readers were retrained. Following the 
retraining, five different papers from the sample were read by each reader and 
results evaluated. 

8. Following analysis of the results of scoring following this retraining, a decision 
was made to continue scoring or, alternately, to rescore all the previously scored 
responses along the remaining responses for the item under consideration. A 
total of 95 calibration bridges were conducted in reading. 



5^.10 The Performance Scoring Center 

The performance scoring center uses a desktop scanner interfaced with a PC for 
collecting score data. The software, scanner, and performance center scoring sheet used for 
NAEP were all developed by NCS. This scoring system is designed to add efficiency and 
portability to traditional paper-based scoring projects. The scoring system software is 
customized to NAEP’s needs including all items and valid score ranges. The demographic 
information, batch, sequence, and barcode numbers are pre-slugged onto the performance center 
scoring sheet obtained from the clean-post file after the editing process. These score sheets are 
then delivered to the scoring center for use when scoring the student documents 

The performance scoring center system offers unique attributes that are ideal for paper- 
based scoring projects. One advantage the system offers is the capability of scanning scoring 
sheets in random order. This provides the means for continuous scanning if a scoring sheet is 
rejected with an error (e.g., score out of range). Another advantage is the ability to produce 
inter-reader reliability information upon request. The reliability reports produced record the 
total occurrences of second score for each item. It also reflects the total for agree and disagree 
and calculates the percentages of agreement. Reports can be produced, on request, on an item 
basis for a particular team or an individual reader. This enables us to ensure the validity of 
scoring by item, reader and team. Additionally, the performance scoring center system has the 
unique ability to produce reports that indica*'- .. of sheets left to scan by project, batch, 
and sheet. This guarantees scoring sheet accv)untaoility and assures that a score is assigned to 
every student response. 

The 1994 national and state assessments had some components that were not conducive 
to image scoring but were ideal for scoring using the performance scoring center system, 
including the NAEP Packet (1992 rescore). The NAEP 1992 rescore items did not require 
second scoring; therefore, there is no interreader reliability information to report on that 
component. 



5.6 DATA DELIVERY 

The 1994 NAEP data collection resulted in several classes of data files — student, school, 
teacher, excluded student, lEP/LEP student, sampling weight, student/teacher match and item 
information. Item information included item data from all assessed students in 1994, item data 
for the short-term reading trend, and item data from the special study comparing image-based 
and paper-based scoring. Data resolution activities occurred prior to the submission of data 

110 



130 




files to ETS and Westat to resolve any irregularities that existed. This section details additional 
steps performed before creating of the final data files to ensure the most complete and accurate 
information was captured. 

An important quality control component of the image scoring system was the inclusion, 
for purposes of file identification, of an exact copy of the entire student edit record, including 
the student booklet ID number, with every image of a student’s response to a constructed- 
response item. These edit files also remained in the main data files residing on the NCS 
mainframe computer. By doing this, exact matching of scores assigned to constructed-response 
items and the rest of each individual student’s data was guaranteed as the booklet ID for each 
image was part of every image file. 

When all the responses for an individual item had been scored, the system automatically 
submitted aU item scores assigned during scoring and their edit records to a queue to be 
transmitted to the mainframe. Project staff then initiated a system job to transmit all scoring 
data to be matched with the original st’ dent records on the mainframe. A custom edit program 
matched the edit records of the scoring files to those of the original edit records on the 
mainframe. As matches were confirmed, the scores were applied to those individual files. After 
completion of this stage, all data collected for an individual student was located in one single 
and complete record/file identified by the edit record. 

Some of the assessed students were determined to be ineligible for the assessment 
because they did not match the particular age/grade being sampled or because of unusual 
circumstances. At the conclusion of each assessment, it was necessary to delete the records of 
these students from the NAEP database. Deleting this information required compiling a list of 
all student records that had been processed with administration codes other than those for 
assessed students. To do this, the process control system and the Administration Schedule data 
were referenced. If the system showed a discrepancy, project personnel pulled the 
Administration Schedules and other documentation (e.g., alerts, student booklets, etc.) to verify 
and resolve the discrepancy. 

The edits and data verification performed on the lEP/LEP student questionnaires 
assured that information regarding the lEP/LEP status of the students was not left blank. If 
there was no indication as to lEP or LEP on the questionnaire cover, the edit clerk cross- 
checked the administration schedule(s) and student booklet cover to confirm the lEP/LEP 
status of the student. If this information was not available from the questionnaire cover, booklet 
cover, or the administration schedule, the edit clerk viewed the information indicated in 
question #1 (which asked why the student was classified lEP or LEP) to see whether responses 
written there might yield useful information. Then the determination was made as to how the 
student should be classified. 

The school questionnaires were revised for 1994 so that some items that had required 
school staff to provide a percentage figure by gridding ovals in a matrix were changed to allow 
the respondent to simply write the percentage in a box. These data was then captured via ICR 
technology and verified by an edit operator. 

To obtain the best possible match of teacher questionnaires to student records, the same 
processes that were followed in 1992 were refined in 1994. The first step in matching was to 

111 




131 



identify teacher questionnaires that had not been returned to NCS for processing, so as not to 
include the students of these teachers from the matching process. Student identification 
numbers that were not matched to a teacher questionnaire were then crossreferenced with the 
corresponding Administration Schedule and Roster of Teacher Questionnaires to verify the 
teacher number, teacher period, and questionnaire number recorded on these control 
documents. If a change could be made that would result in a match, the correction was applied 
to the student record. The NAEP school numbers listed on the Roster of Questionnaires, 
Administration Schedule, and teacher questionnaire were verified and corrected, if necessary. 

Once these resolutions were made, any duplicate teacher numbers that existed within a 
school were crossreferenced with the Rosters of Questionnaires for resolution, if possible. In 
one jurisdiction that had multiple sessions in many schools, a number of the schools used a 
single Roster of Questionnaires for each session. This resulted in a larger than expected 
number of duplicate teacher numbers that could not be resolved. Tlie overall quality of the 
matching process improved in 1994 as a result of the inclusion of the teacher number and period 
on the Administration Schedule. Since this information was located together on a single, central 
control document, the ability to match and resolve discrepant or missing fields was simplified. 

After aU data processing activities were completed, data cartridges or tapes were created 
and shipped via overnight delivery to ETS and/or Westat, as appropriate. A duplicate archive 
file is maintained at NCS for security /backup purposes. 



5.7 MISCELLANEOUS 
5.7.1 Storage of Documents 

After the batches of image-scanned documents had successfully passed the editing 
process, they were sent to the warehouse for storage. Batches of 1992 rescore booklets were 
sent to the scoring area after passing the edit phase of processing, because they were also to be 
scored on paper. Once paper scoring was completed, 1992 rescore booklets were also sent to 
the warehouse for storage. The storage locations of all documents were recorded on the 
inventory control system. Unused materials were sent to temporary storage to await completion 
of the entire assessment. After the data tape was accepted, extra inventory was destroyed and a 
nominal supply of materials was stored permanently. 



S.12 Quality Control Documents 

ETS requested that a random sample of booklets and the corresponding scores/scoring 
sheets be pulled for an additional quality control check. Because no scoring sheet was available 
for image-scanned documents, ETS used scores sent to them on a data tape to verify the 
accuracy of applied scores. For nonscannable trend booklets and for the 1992 rescore booklets 
that were scored on paper, both the booklet and its corresponding score sheet were sent to ETS, 
An average of 20 of each booklet and scores/scoring sheets for each document type were 
selected at random by NCS. All of these documents were selected prior to sending the booklets 
to storage and were then sent to ETS to verify the accuracy and completeness of the data. 



112 



132 



o 




5J3 Alert Analyst 



Even though Receiving Department personnel were trained in the resolution of many 
problematic situations, some problems required resolution by NAEP staff. These are listed in 
Table 5-6. The types of problems were categorized and codes ("N" for national and "S" for Trial 
State) were assigned. For any unusual situations, Westat was called so that the Assessment 
Supervisors could be notified immediately to avoid further problems in test admmistration. 

Many discrepancies were found in the receiving process that did not require an alert to 
be issued, but did require a great deal of effort to resolve in order to provide the most complete 
and accurate information. These included blank fields on covers of booklets as well as 
discrepancies between the booklet covers and the administration schedule. There were a total 
of 311 alerts for the Trial State Assessment. 



Table 5-6 

Alerts for 1994 National and Trial State Assessments 



Code 


Description 


Nl/Sl 


Booklet covers not fully completed or bubbled 


N2/S3 


Information on covers does not match Administration Schedule 


N3/S3 


Handwritten or photocopied Administration Schedule 


N4/S4* 


Student Listing Form returned 


N5/S5 


Questionnaires discrepant with roster 


N8/S8* 


School shipments returned unused 


NIO/SIO 


Booklets missing or unaccounted for (i.e., make-up sessions) 


Nll/Sll 


Administration Code questionable 


N17/S17 


Roster/Administration Schedule not received 


N25/S25 


Transcribing document 


N26 


Excluded Student Questionnaire not assigned/ # not recorded on booklet 


N27/S27 


lEP/LEP not assigned/ # not recorded on booklet 


N28/S28* 


Booklets with an administration code of 14, 19, or 27 


N29/S29* 


Names returned on Administration Schedule/Roster 


N30/S30 


Other 



* Alerts requiring only an information code. 



113 



133 



Chapter 6 



CREATION OF THE DATABASE, QUALITY CONTROL OF DATA ENTRY, 
AND CREATION OF THE DATABASE PRODUCTS 



John J. Ferris, David S. Freund, and Alfred M. Rogers 
Educational Testing Service 



6.1 OVERVIEW 

The data transcription and editing procedures described in Chapter 5 resulted in the 
generation of disk and tape files containing various data for assessed students, excluded 
students, teachers, and schools. The weighting procedures described in Chapter 7 resulted in 
the generation of data files that included the sampling weights required to make valid statistical 
inferences about the population from which the 1994 fourth-grade Trial State Reading 
Assessment samples were drawn. These files were merged into a comprehensive, integrated 
database. The creation of the database is described in section 6.2. 

To evaluate the effectiveness of the quality control of the data entry process, the 
corresponding portion of the final integrated database was verified in detail against the sample 
of original instruments received from the field. The results of this procedure are given in 
sex:tion 6.3. 

The integrated database was the source for the creation of the NAEP item information 
database and the NAEP secondary-use data files. These are described in section 6.4. 



6.2 CREATION OF THE DATABASE 

6.2.1 Merging Files into the Trial State Assessment Database 

The transcription process conducted by National Computer Systems resulted in the 
transmittal to ETS of four data files for fourth grade: one file for each of the three 
questionnaires (teacher, school, and lEP/LEP student) and one file for the student response 
data. The sampling weights, derived by Westat, Inc., comprised an additional three files— one 
for students, one for schools, and one for excluded students. (See Chapter 7 for a discussion of 
the sampling weights.) These seven files were the foundation for the analysis of the 1994 Trial 
State Assessment data. Before data analyses could be performed, these data files had to be 
integrated into a coherent and comprehensive database. 

The 1994 Trial State Reading Assessment database for fourth grade consisted of three 
files — student, school, and excluded student. Each record on the student file contained a 

115 




131 



student’s responses to the particular assessment booklet the student was administered (booklets 
R1 to R16) and the information from the questionnaire that the student’s reading teacher 
completed. Additionally, for those assessed students who were identified as having an 
Individualized Education Plan (lEP) or Limited English Proficiency (LEP), data from the 
lEP/LEP Questionnaire is included. (Note that beginning with the 1994 assessment, the 
lEP/LEP questionnaire replaces the excluded student questionnaire. This questionnaire is filled 
out for all students identified as lEP and/or LEP, both assessed and excluded. See Chapter 2 
for information regarding assessment instruments.) Since teacher response data can be reported 
only at the student level, it was not necessary to have separate teacher files. The school files 
and student files (both assessed and excluded) were separate files and could be linked via the 
state, school, and school type codes. 

The creation of the student data files began with the reorganization of the data files 
received from National Computer Systems. This involved two major tasks: 1) the files were 
restructured, eliminating unused (blank) areas to reduce the size of the files; and 2) in cases 
where students had chosen not to respond to an item, the missing responses were recoded as 
either "omitted" or "not reached," as appropriate. Next, the student response data were merged 
with the student weights file. The resulting file was then merged with the teacher response data. 
In both merging steps, the booklet ID (th 2 three-digit booklet number and the six-digit serial 
number) was used as the matching criterion. 

The school file was created by merging the school questionnaire file with the school 
weights file and a file of school-level variables, supplied by Westat and Quality Education 
Department, Inc. (QED), that included demographic information about the schools such as 
Race/Ethnicity percentages. The state, school, and school type codes were used as the matching 
criteria. Since some schools did not return a questionnaire and/or were missing QED data, 
some of the records in the school file contained only school-identifying information and 
sampling weight information. 

The excluded student file was created by merging the lEP/LEP student questionnaire 
file with the excluded student weights file. The assessment booklet serial number was used as 
the matching criterion. 

When the student, school, and excluded student files had been created, the database was 
ready for analysis. In addition, whenever new data values, such as composite background 
variables or plausible values, were derived, they were added to the appropriate database files 
using the same matching procedures as described above. 

For archiving purposes, restricted-use data files and codebooks for each jurisdiction were 
generated from this database. The restricted-use data files contain all responses and response- 
related data from the assessment, including responses from the student booklets and teacher and 
school questionnaires, proficiency scores, sampling weights, and variables used to compute 
standard errors. 



622 Creating the Master Catalog 

A critical part of any database is its processing control and descriptive information. 
Having a central repository of this information, which may be accessed by all analysis and 

116 

13«j 



o 

ERIC 



reporting programs, will provide correct parameters for processing the data fields and consistent 
label ag for identifying the results of the analyses. The Trial State Assessment master catalog 
file was designed and constructed to serve these purposes for the Trial State Assessment 
database. 

Each record of the master catalog contains the processing, labelir.g, classification, and 
location information for a data field in the Trial State Assessment database. The control 
parameters are used by the access routines in the analysis programs to define the manner in 
which the data values are to be transformed and processed. 

Each data field has a 50-character label in the master catalog describing the contents of 
the field and, where applicable, the source of the field. The data fields with discrete or 
categorical values (e.g., multiple-choice and constructed-response items, but not weight fields) 
have additional label fields in the catalog containing 8- and 20-character labels for those values. 

The classification area of the master catalog record contains distinct fields corresponding 
to predefined classification categories (e.g., reading content area) for the data fields. For a 
particular classification field, a nonblank value indicates the code of the subcategory within the 
classification categories for the data field. This classification area permits the grouping of 
identically classified items or data fields by performing a selection process on one or more 
classification fields in the master catalog. 

The master catalog file was constructed concurrently with the collection and transcription 
of the Trial State Assessment data so that it would be ready for use by analysis programs when 
the database was created. As new data fields were derived and added to the database, their 
corresponding descriptive and control information were entered into the master catalog. The 
machine-readable catalog files are available as part of the secondary-use data files package for 
use in analyzing the data with programming languages other than SAS and SPSS-X (see the 
NAEP 1994 Trial State Assessment in Reading Secondary-use Data Files User Guide). 



63 QUALITY CONTROL EVALUATION 

The purpose of the data entry quality control procedure is to gauge the overall accuracy 
of the process that transforms responses into machine-readable data. The procedure involves 
examining the actual responses made in a random sample of booklets and comparing them, 
mark by mark and character by character, with the responses recorded in the final database, 
which is used fur analysis and reporting. 

In the present assessment, the selection of booklets for this comparison took place at the 
point of first entry into the recording process for data from the field. In past assessments, this 
selection took place only after data had reached the final database, in order to assure that only 
relevant booklets were involved in the quality control evaluation. While the new method of 
selection did result in some irrelevant booklets — due to absentee students or other 
problems— sufficient numbers of booklets were ultimately selected that did appear in the final 
database. The earlier availability of booklets for quality control evaluation and the improved 
efficiency of this new selection process were adequate compensation for the loss of control over 
which booklets were involved in quality control evaluation. 



117 




136 



63.1 Student Data 



Sixteen assessment booklets, R1 through R16, were administered as part of the Trial 
State Assessment in reading. Table 6-1 provides the numbers of each booklet for which data 
were scanned into data files. The variation in these numbers is trivial, indicating very good 
control of the distribution process. 

The number of students assessed in each of the 44 participating jurisdictions varied from 
a low of 2,081 to a high of 3,147. All but two jurisdictions met or exceeded the target 
participation rate for public schools. The average number of students assessed in each 
jurisdiction was 2,766. This was somewhat higher than the average in 1992. 

For the first time, the data entry process relied on image processing technology for 
recording the scores assigned by professional readers to the students’ constructed responses. 

The scanned image of a student’s response to one of these items was presented on the computer 
screen of a reader’s work station. After determining the score for the item, the reader then 
entered this score using the keyboard at the work station. 

This new process raised the question of what to verify or check in a quality control 
operation. The usual issue — ^whether the response that ended up in the final database is the 
same as the original intended response — could not be raised here, since the reader’s intention, 
which defines the data, was entered directly into the database without any intermedi'.te steps. 

The question of whether readers consistently and accurately applied agreed-upon scoring rubrics 
was not at issue here; that question falls into the province of reader reliability studies. In short, 
the data for these items existed in only one form, the database itself, and could not be verified 
against any earlier or preliminary form. 

Rather than abdicate all quality control responsibility for these items, we chose to verify 
the process itself. Two important questions were examined: 

1. Was the identity of the respondent maintained? Did a respondent’s scores end up in his 
or her data record and not someone else’s? 

2. Was the identity of the item maintained? Did the score for each constructed-response 
item in this booklet end up correctly identified in the database, or was it transposed with 
another item response or perhaps left out? 

Four different booklets in this assessment contained some number of constructed- 
response items requiring professional scoring. To verify that the system was functioning 
correctly, four sets of artificial data were carefully constructed, one set for the constructed- 
response items in each of these booklets. Each set consisted of two booklets, representing a 
total of eight "respondents". These booklets were filled in with pre-assigned scores and 
processed in the usual way, the only difference being that the readers were presented with the 
score to assign, rather than with a passage to be evaluated. 

To assure correct identification of a booklet (question # 1 above), the score pattern of 
each booklet was made unique, even under the assumption that the scorers made one recording 
error in every booklet. Such an error would not be relevant to the question of whether the 




Table 6-1 



Number of Reading Booklets Scanned and Selected for Quality Control Evaluation 



Booklet 

Number 


Total Booklets 
Scanned 


Total Booklets 
Selected 


R1 


7,604 


20 


R2 


7,562 


19 


R3 


7,591 


20 


R4 


7,630 


19 


R5 


7,639 


17 


R6 


7,616 


18 


R7 


7,562 


21 


R8 


7,525 


22 


R9 


7,583 


19 


RIO 


7,627 


21 


Rll 


7,614 


21 


R12 


7,637 


20 


R13 


7,656 


21 


R14 


7,646 


21 


R15 


7,611 


19 


R16 


7,615 


22 


Total 


121,718 


320 



119 



13S 




correct respondent was being scored. As noted above, the question of whether the correct score 
is being assigned needs to be addressed througli reader reliability studies. 

To assure that item identity was maintained within a booklet (question #2 above), 
different responses were used across the constructed-response items for each booklet. Since the 
number of different responses was almost never adequate to allow making each response unique 
within a booklet, a second sample of each booklet was needed. Any item response which had to 
be duplicated within the first booklet of such a pair was designed to be different in the second 
booklet, and vice versa. 

We are pleased to report reassurance for both of the above questions. Both item and 
respondent integrity were maintained in these booklets of artificial data. 

Student booklets were sampled in adequate numbers and the average rate of selection 
was about one out of 380, a selection rate comparable to that used in past assessments at both 
the state and national levels. The few errors found during this quality control examination did 
not cluster by booklet number, so there is no reason to believe that the variation in numbers of 
booklets selected had a significant effect on the estimates of overall error rate confidence limits 
reported below. 

The quality control evaluation detected 14 errors in these student booklet samples, about 
evenly divided between multiple responses that were not identified as such by the scanner and 
erasures that were recorded instead of ignored. As usual, there was some indication that the 
error rate could be improved with further tuning of the scanner procedures; the erroneously 
scanned responses would not have challenged human judgment — indeed, that was the criterion 
used to determine whether a mis-scanning had occurred. Not to lose sight of the final goal, 
however, the process as it stands can still be described as adequate for the support of 
conclusions about educational progress in America. A very large volume of data was scanned 
with consistently usable results. The usual quality control analysis based on the binomial 
theorem permits the inferences described in Table 6-2. 



Table 6-2 

Inference from the Quality Control Evaluation of Grade 4 Data 



Subsample 


Selection 

Rate 


Different 

Booklets 

Sampled 


Number 

of 

Booklets 

Sampled 


Characters 

Sampled 


Number 

of 

Errors 


Observed 

Rate 


Upper 

99^% 

Confidence 

Limit 


Student 


1/380 


16 


320 


19,792 


14 


.0007 


.0015 


Teacher 


1/104 


1 


154 


14,168 


11 


.0008 


.0017 


School 


1/77 


1 


61 


6,588 


3 


.0005 


.0019 


lEP/LEP 


1/215 


1 


75 


5,850 


12 


.0021 


.0044 



120 



13U 



632 Teacher Questionnaires 



A total of 16,011 questionnaires from reading teachers were associated with student data 
in the final database. These questionnaires were sampled at the rate of 1 in 104, roughly double 
the rate used in previous years. The 154 selected questionnaires contained a total of eleven 
errors in eleven different booklets, usually involving the scanner’s mistaking an erasure for a 
response, but occasionally involving the failure of the scanner to pick up a multiple response. In 
every case, the respondent’s intention was clear to the human eye, but the scanner seemed 
unprepared to exercise the same judgment that a careful observer would. The resulting error 
rate for the teacher questionnaire data was about the same as that for the student data. The 
quality of the teacher data is more than adequate for the purposes to which it was put. 



6J3 School Questionnaires 

A total of 4,704 questionnaires were coUected from school administrators. These 
questionnaires were sampled for quality control evaluation at the rate of 1 in 77, resulting in the 
selection of 61 questionnaires. The three errors that were found represent an error rate about 
the same as that for the teacher questionnaire data, and about the same as that for school 
questionnaires in past years. 



6J.4 lEP/LEP Student Questionnaires 

A total of 16,149 lEP/LEP questionnaires were scanned. About half of these 
questionnaires appear in the main student database, representing students who were included in 
the assessment. In the past, all students given this kind of questionnaire were excluded, and the 
instrument was referred to as the excluded student questionnaire. The overall selection rate was 
about 1 in 215, comparable to that used in earlier assessments for this questionnaire. Seventy- 
five questionnaires were selected in all. Both the selection rates and the resulting error rates 
were about the same in the two pools of students. Nearly all of the 12 errors found were due to 
the scanner’s mistaking an erasure for an intended response. The quality of these data appears 
to be about as high as the other questionnaires — that is to say, adequate for the purposes to 
which it was put. 

The results of the evaluation of all questionnaire data, as well as the student data, are 
summarized in Table 6-2. 



6.4 NAEP DATABASE PRODUCTS 

The NAEP database described to this point serves primarily to support analysis and 
reporting activities that are directly related to the NAEP contract. This database has a singular 
structure and access methodology that is integrated with the NAEP analysis and reporting 
programs. One of the directives of the NAEP contract is to provide secondary researchers with 
a nonproprietary version of the database that is portable to any computer system. In the event 
of transfer of NAEP to another client, the contract further requires ETS to provide a full copy 
of the internal database in a format that may be installed on a different computer system. 

121 



O 



140 




In fulfillment of these requirements, ETS provides two sets of database products: the 
item information database and the secondary-use data files. The contents, format and usage of 
these products are documented in the publications listed under the appropriate sections below. 



6.4.1 The Item Information Database 

The NAEP item information database contains all of the descriptive, processing, and 
usage information for every assessment item developed and used for NAEP since 1970. The 
primary unit of this database is the item. Each NAEP item is associated with different levels of 
information, including usage across years and age cohorts, subject area classifications, response 
category descriptors, and locations of response data on secondary-use data files. 

The item information database is used for a variety of essential NAEP tasks: providing 
statistical information to aid in test construction, determining the usage of items across 
assessment years and ages for trend and cross-sectional analyses, labeling summary analyses and 
reports, and organizing items by subject area classifications for scaling analysis. 

The creation, structure, and use of the NAEP item information database for all items 
used up to and including the 1994 assessment are fully documented in the NAEP publications A 
Guide to the NAEP Item Information Database (Rogers, Barone, & Kline, 1995) and A Primer for 
the NAEP Item Information Database (Rogers, Kline, Barone, Mychajlowycz, & Forer, 1989). 

The procedures used to create the 1994 version of the item information database are the 
same as those documented in the guide. The updated version of the guide also contains the 
subject area classification categories for the cognitive items. 



6.4.2 The Secondary-use Data Files 

The secondary-use data files are designed to enable any researcher with an interest in 
the NAEP database to perform secondary analysis on the same data as those used at ETS. The 
three elements of the distribution package are the data files, the printed documentation, and 
copies of the questionnaires and released item blocks. A set of files for each sample or 
instrument contains the response data file, a file of control statements that will generate an 
SPSS system file, a file of control statements that will generate a SAS system file, and a 
machine-readable catalog file. Each machine-readable catalog file contains sufficient control and 
descriptive information to permit the user who does not have either SAS or SPSS to set up and 
perform data analysis. The printed documentation consists of two volumes: a guide to the use 
of the data files, and a set of data file layouts and codebooks for each of the participants in the 
assessment. 

The remainder of this section summarizes the procedures used in generating the data 
files and related materials. 



122 



141 



6.42.1 File Definition 



There are essentially five samples for analysis in the 1994 Trial State Reading 
Assessment: the assessed students, the excluded students, and the schools in the state-by*state 
component, and the assessed students and the schools in a matched comparison sample drawn 
from the national reading assessment. Each state sample is divided into separate files by each 
jurisdiction, resulting in a total of over 130 files, but the same file formats, linking conventions, 
and analysis considerations apply to each file within a given sample. For example, the analysis 
specification that links school and assessed student data for California would apply identically to 
New York, Illinois, or any other participant or group of participants. 

Each participant data file still requires its own data codebook, detailing the frequencies 
of data values within ths** jurisdiction. TTie file layouts, SPSS and SAS syntax and 
machine-readable catalog files, however, need only be generated for each sample, since the 
individual jurisdiction data files within a state sample are identical in format and data code 
definition. 



6.422 Definition of the Variables 

The lifting of the restraint on confidential data simplified the variable definition process 
as it permitted the transfer of all variables from the database to the secondary-use files. 

The initial step in this process was the generation of a LABELS file of descriptors of the 
variables for each data sample to be created. Each record in a LABELS file contains, for a 
single data field, the variable name, a short description of the variable, and processing control 
information to be used by later steps in the data generation process. This file could be edited 
for deletion of variables, modification of control parameters, or reordering of the variables 
within the file. The LABELS file is an intermediate file only; it is not included on the released 
data files. 

The next program in the processing stream, GENLYT, produced a printed layout for 
each file from the information in its corresponding LABELS file. These layouts were initially 
reviewed for the ordering of the variables. 

The variables on all data files were grouped and arranged in the following order: 
identification information, weights, derived variables, proficiency scale scores (where applicable), 
and response data. On the student data files, these fields were followed by the teacher response 
data and the lEP/LEP student questionnaire data, where applicable. The identification 
information is taken from the front covers of the instruments. The weight data include sample 
descriptors, selection probabilities, and replicate weights for the estimation of sampling error. 
The derived data include sample descriptions from other sources and variables that are derived 
from the response data for use in analysis or reporting. 

For each assessed student sample in the state component and national comparison 
sample, the item response data within each block were left in their order of presentation. The 
blocks, however, were arranged according to the following scheme: common background, 
subject-related background, the cognitive blocks in ascending numerical order, and student 




motivation. The responses to cognitive blocks that were not present in a given booklet were left 
blank, signifying a condition of "missing by design." 

In order to process and analyze the spiral sample data effectively, the user must also be 
able to determine, from a given booklet record, which blocks of item response data were present 
and their relative order in the instrument. This problem was remedied by the creation of a set 
of control variables, one for each block, which indicated not only the presence or absence of the 
block but its order in the instrument. These control variables were included with the derived 
variables. 



6A23 Data Definition 

To enable the data files to be processed on any computer system using any procedural or 
programming language, it was desirable that the data be expressed in numeric format. This was 
possible, but not without the adoption of certain conventions for reexpressing the data values. 

As mentioned in section 6.1, the responses to all multiple-choice items were transcribed 
and stored in the database using the letter codes printed in the instruments. This scheme 
afforded the advantage of saving storage space for items with 10 or more response options, but 
at the expense of translating these codes into their numeric equivalents for analysis purposes. 
The response data fields for most of these items would require a simple alphabetic-to-numeric 
conversion. However, the data fields for items with 10 or more response choices would require 
"expansion" before the conversion, since the numeric value would require two column positions. 
One of the processing control parameters on the LABELS file indicates whether or not the data 
field is to be expanded before conversion and output. 

The ETS database contained special codes to indicate certain response conditions: "I 
don’t know" responses, multiple responses, omitted responses, not-reached responses, and 
unresolvable responses, which included out-of-range responses and responses that were missing 
due to errors in printing or processing. The scoring guides for the reading constructed-response 
items included additional special codes for ratings of "illegible," "off task," and non-rateable by 
the scorers. All of these codes had to be reexpressed in a consistent numeric format. 

The following convention was adopted and used in the designation of these codes: The 
"I don’t know" and non-rateable response codes were always converted to 7; the "omitted" 
response codes were converted to 8; the "not-reached" response codes were converted to 9; the 
multiple response codes were converted to 0; the "illegible" codes were converted to 5; and the 
"off task" codes were converted to 6. The out-of-range and missing responses were coded as 
blank fields, corresponding to the "missing by design" designation. 

This coding scheme created conflicts for those multiple-choice items that had seven or 
more valid response options as well as the "I don’t know" response and for those constructed- 
response items whose scoring guide had five or more categories. These data fields were also 
expanded to accommodate the valid response values and the special codes. In these cases, the 
special codes were "extended" to fill the output data field: The "I don’t know" and non-rateable 
codes were extended from 7 to 77, omitted response codes from 8 to 88, etc. 



124 



Uj 



Each numeric variable on the secondary-use files was classified as either continuous or 
discrete. The continuous variables include the weights, proficiency values, identification codes, 
and item responses where counts or percentages were requested. The discrete variables include 
those items for which each numeric value corresponds to a response category. The designation 
of "discrete" also includes those derived variables to which numeric classification categories have 
been assigned. The constructed-response items were treated as a special subset of the discrete 
variables and were assigned to a separate category to facilitate their identification in the 
documentation. 



6.42.4 Data File Layouts 

The data file layouts, as mentioned above, were the first user product to be generated in 
the secondary-use data files process. The generation program, GENLYT, used a LABELS file 
and a CATALOG file as input and produced a printable file. The LAYOUT file is little more 
than a formatted listing of the LABELS file. 

Each line of the LAYOUT file contains the following information for a single data field; 
sequence number, field name, output column position, field width, number of decimal places, 
data type, value range, key or correct response value, and a short description of the field. The 
sequence number of each field is implied from its order on the LABELS file. The field name is 
an 8-character label for the field that is to be used consistently by all secondary-use data files 
materials to refer to that field on that file. The output column position is the relative location 
of the beginning of that field on each record for that file, using bytes or characters as the unit of 
measure. The field width indicates the number of columns used in representing the data values 
for a field. If the field contains continuous numeric data, the value under the number of 
decimal places entry indicates how many places to shift the decimal point before processing data 
values. 



The data type category uses five codes to designate the nature of the data in the field: 
Continuous numeric data are coded "C"; discrete numeric data are coded "D"; constructed- 
response item data are coded "OS" if the item was dichotomized for scaling and "OE" if it was 
scaled under a polytomous response model. AdditionaUy, the discrete numeric fields that 
include "I don’t know" response codes are coded "DI." If the field type is discrete numeric, the 
value range is listed as the minimum and maximum permitted values separated by a hyphen to 
indicate range. If the field is a response to a scorable item, the correct option value, or key, is 
printed; if the field is an assigned score that was scaled as a dichotomous item using cut point 
scoring, the range of correct scores is printed. Each variable is further identified by a 
50-character descriptor. 



6.42 J Data File Catalogs 

The LABELS file contains sufficient descriptive information for generating a brief layout 
of the data file. However, to generate a complete codebook document, substantially more 
information about the data is required. The CATALOG file provides most of this information. 



125 



114 





The CATALOG file is created by the GENCAT program from the LABELS file and the 
1994 master catalog file. Each record on the LABELS file generates a CATALOG record by 
first retrieving the master catalog record corresponding to the field name. The master catalog 
record contains usage, classification, and response code information, along with positional 
information from the LABELS file: field sequence number, output column position, and field 
width. Like the LABELS file, the CATALOG file is an intermediate file and is not included on 
the released data files. 

The information for the response codes, also referred to as "foils," consists of the valid 
data values for the discrete numeric fields, and a 20-character description of each. The 
GENCAT program uses additional control information from the LABELS file to determine if 
extra foils should be generated and saved with each CATALOG record. The first flag controls 
generation of the "I don’t know" or non-rateable foil; the second flag regulates omitted or not- 
reached foil generation; and the third flag denotes the possibility of multiple responses for that 
field and sets up an appropriate foil. All of these control parameters, including the expansion 
flag, may be altered in the LABELS file by use of a text editor, in order to control the 
generation of data or descriptive information for any given field. 

The LABELS file supplies control information for many of the subsequent secondary-use 
data processing steps. The CATALOG file provides detailed information for those and other 
steps. 



6.42.6 Data Codebooks 

The data codebook is a printed document containing complete descriptive information 
for each data field. Most of this information originates from the CATALOG file; the remaining 
data came from the COUNTS file and the IRT parameters file. 

Each data field receives at least one line of descriptive information in the codebook. If 
the data type is continuous numeric, no more information is given. If the variable is discrete 
numeric, the codebook lists the foil codes, foil labels, and frequencies of each value in the data 
file. Additionally, if the field represents an item used in IRT scaling, the codebook lists the 
parameters used by the scaling program. 

Certain blocks of cognitive items in the 1994 assessment that are to be used again in 
later assessments for trend comparisons have been designated as nonreleased. In order to 
maintain their confidentiality, generic labels have been substituted for the response category 
descriptions of these items in the data codebooks and the secondary-use files. 

The frequency counts are not available on the catalog file, but must be generated from 
the data. The GENFREQ program creates the COUNTS file using the field name to locate the 
variable in the database, and the foil values to validate the range of data values for each field. 
This program also serves as a check on the completeness of the foils in the CATALOG file, as it 
flags any data values not represented by a foil value and label. 

The IRT parameter file is linked to the CATALOG file through the field name. Printing 
of the IRT parameters is governed by a control flag in the classification section of the 

126 




145 



CATALOG record. If an item has been scaled for use in deriving the proficiency estimates, the 
IRT parameters are listed to the right of the foil values and labels, and the score value for each 
response code is printed to the immediate right of the corresponding frequency. 

The LAYOUT and CODEBOOK files are written by their respective generation 
programs to print-image disk data files. Draft copies are printed and distributed for review 
before the production copy is generated. The production copy is printed on an IBM 3800 
printer that uses laser-imaging technology to produce high-quality, reproducible documentation. 



6.4J2.7 Control Statement Files for Statistical Packages 

An additional requirement of the NAEP contract is to provide, for each secondary-use 
data file, a file of control statements each for the SAS and SPSS statistical systems that will 
convert the raw data file into the system data file for that package. Two separate programs, 
GENSAS and GENSPX, generate these control files using the CATALOG file as input. 

Each of the control files contains separate sections for variable definition, variable 
labeling, missing value declaration, value labeling, and creation of scored variables from the 
cognitive items. The variable definition section describes the locations of the fields, by name, in 
the file, and, if applicable, the number of decimal places or type of data. The variable label 
identifies each field with a 50-character description. The missing value section identifies values 
of those variables that are to be treated as missing and excluded from analyses. The value labels 
correspond to the foils in the CATALOG file. The code values and their descriptors are listed 
for each discrete numeric variable. The scoring section is provided to permit the user to 
generate item score variables in addition to the item response variables. 

Each of the code generation programs combines three steps into one complex procedure. 
As each CATALOG file record is read, it is broken into several component records according to 
the information to be used in each of the resultant sections. These record fragments are tagged 
with the field sequence number and a section sequence code. They are then sorted by section 
code and sequence number. Finally, the reorganized information is output m a structured 
format dictated by the syntax of the processing language. 

The generation of the system files accomplishes the testing of these control statement 
files. The system files are saved for use in special analyses by NAEP staff. These control 
statement fUes are included on the distributed data files to permit users with access to SAS 
and/or SPSS to create their own system files. 



6A2.S Machine-readable Catalog Files 

For those NAEP data users who have neither SAS nor SPSS capabilities, yet require 
processing control information in a computer-readable format, the distribution files also contain 
machine-readable catalog files. Each machine-readable catalog record contains processing 
control information, IRT parameters, and foil codes and labels. 



127 




14G 



6A2.9 NAEP Data on Disk 



The complete set of secondary-use data files described above are available on CD-ROM 
as part of the NAEP Data on Disk product suite. This medium can be ideal for researchers and 
policy makers operating in a personal computing environment. 

The NAEP Data on Disk product suite includes two other components that facilitate the 
analysis of NAEP secondary-use data. The PC-based NAEP data extraction software, NAEPEX, 
enables users to create customized extracts of NAEP data and to generate SAS or SPSS control 
statements for preparing analyses or generating customized system files. The NAEP analysis 
modules, which currently run under SPSS® for Windows™, use output files from the extraction 
software to perform an^yses that incorporate statistical procedures appropriate for the NAEP 
design. 



128 



14 V 



o 

ERIC 



Chapter 7 

WEIGHTING PROCEDURES AND VARIANCE ESTIMATION 



Mansour Fahimi, Keith F. Rust, and John Burke 
Westat, Inc. 



7.1 OVERVIEW 

Following the collection of assessment and background data from and about assessed and 
excluded students, the processes of deriving sampling weights and associated sets of replicate 
weights were carried out. The sampling weights are needed to make valid inferences from the 
student samples to the respective populations from which they were drawn. Replicate weights 
are used in the estimation of sampling variance, through the procedure known as jackknife 
repeated replication. 

Each student was assigned a weight to be used for making inferences about the state’s 
students. This weight is known as the full-sample or overall sample weight. The full-sample 
weight contains three components. First, a base weight is established which is the inverse of the 
overall probability of selection of the sampled student. The base weight incorporates the 
probability of selecting a school and the student within a school. This weight is then adjusted 
for two sources of nonparticipation — school level and student level. These weighting 
adjustments seek to reduce the potential for bias from such nonparticipation by increasing the 
weights of students from schools similar to those schools not participating, and increasing the 
weights of students similar to those students from within participatuig schools who did not 
attend the assessment session as scheduled. The details of how these weighting steps were 
implemented are given in sections 7.2 and 7.3, 

Section 7.4 addresses the effectiveness of the adjustments made to the weights using the 
procedures described in section 7.3. The section examines characteristics of nonresponding 
schools and students, and investigates the extent that nonrespondents differ from respondents in 
ways not accounted for in the weight adjustment procedures. Section 7.5 considers the 
distributions of the final student weights in each state, and whether there were outliers that 
called for further adjustment. 

In addition to the full-sample weights, a set of replicate weights was provided for each 
student. These replicate weights are used in calculating the sampling errors of estimates 
obtained from the data, using the jackknife repeated replication method. Full details of the 
method of using these replicate weights to estimate sampling errors are contained in the NAEP 
technical reports from the 1992 assessment (Johnson & Carlson, 1994) and earlier. Section 7.6 
of this report describes how the sets of replicate weights were generated for the 1994 Trial State 
Assessment data. The methods of deriving these weights were aimed at reflecting the features 



iO 



129 



of the sample design appropriately in each state, so that when the jackknife variance estimation 
procedure is implemented, approximately unbiased estimates of sampling variance result. 



12 CALCULATION OF BASE WEIGHTS 

The base weight assigned to a school was the reciprocal of the probability of selection of 
that school. The school base weight reflected the actual probability used to select the school 
from the frame, including the impact of avoiding schools selected for the national sample. For 
“new" schools selected using the supplemental new school sampling procedures (see section 
3.5.3), the school base weight reflected the combined probability of selection of the district, and 
school within district. 

The student base weight was obtained by multiplying the school base weight by the 
within-school student weight, where the within-school student weight reflected the probability of 
selecting students within the school. Additional details about the weighting process are given in 
the sections below. 



72.1 Calculation of School Base Weights 

The base weight for sample school i was computed as: 



where 

E, 






E 

mEf 



the enrollment in the given school; 



s 

E = 52 state-wide enrollment obtained by summing across all 

1-1 

schools in the state frame; and 
m = the number of schools selected from the state. 



In each state, all schools included in the sample with certainty were assigned school base 
weights of unity. 



Schools sampled with certainty were sometimes selected more than once in the 
systematic sampling process. If a large school was selected more than once, each selection was 
treated separately in the selection of students within a school. For example, a school that was 
selected twice was allocated twice the usual numbers of students for the assessments; a school 



130 



U'J 




that was selected three times was allocated three times the usual numbers of students for the 
assessments. 



722 Weighting New Schools 

New public schools were identified and sampled through a two-stage sampling process, 
involving the selection of districts, and then of new schools within selected districts. This 
process is described in Chapter 3. There were two distinct processes used depending upon the 
size of the district. 

Within each state, public school districts were partitioned into those having (at most) 
one school with grade 4, one with grade 8, and one with grade 12, versus all other districts. For 
the first set of small districts, the selection of the grade 4 school from the frame, in tlie initial 
sample of schools for the state, triggered an inquiry of the district as to whether there were in 
fact any additional schools with grade 4 (not contained on the school sampling frame). Aiiy 
school so identified was automatically added to the sample for the assessment. Thus the 
selection probability of such a school was equal to that of the grade 4 school from the district 
that was included on the school frame, and the school base weight was calculated accordingly. 

For the larger districts (those having multiple schools at least one of grades 4, 8, and 12), 
a sample of districts was selected in each state. Districts in the sample were asked to identify 
schools having grade 4 that were not included on the school frame. A sample of these newly 
identified schools was then selected. The base weight for these schools reflected both the 
probability that the district was seler ‘ed for this updating process, and that the school was 
included in the NAEP sample, having been identified as a new by the district. 



722 Treatment of New and Substitute Schools 

Schools that replaced a refusing school (i.e., substitute schools) were assigned the weight 
of the refusing school, unless the substitute school also refused. Thus the substitute school was 
treated as if it were the original school that it replaced, for purposes of obtaining school base 
weights. 



72A Calculation of Student Base Weights 

Within the sampled schools, eligible students were sampled. The within-school 
probability of selection therefore depended on the number of eligible students in the school and 
the number of students selected for the assessment (usually 30). The within-school weights for 
the substitute schools were further adjusted to compensate for differences in the sizes of the 
substitute and the originally sampled (replaced) schools. Thus, in general, the within-school 
student weight for the ;th student in school i was equal to: 



w;;*^ = ^ X js:, 



where 



Ni = the number of eligible students enrolled in the school, as reported in the 

sampling worksheets; 

rt, = the number of students selected; and 



K 



£l 



\Ath 

Ei = the QED grade enrollment of the originally sampled (replaced) school; 

and 

E! = the QED grade enrollment of the substitute school. 

The factor in the above formula for the within-school student weight applies to only a 
few schools in each state. This factor adjusts the count of eligible students in a substitute school 
to be consistent with corresponding count of the originally sampled (replaced) school. For 
nonsubstitute schools was set to 1. 



ADJUSTMENTS FOR NONRESPONSE 

As mentioned earlier, the base weight for a student was adjusted by two factors: one to 
adjust fo- nonparticipating schools for which no substitute participated, and one to adjust for 
students who were invited to the assessment but did not appear in the scheduled sessions. 



7J.1 Defining Initial School-Level Nonresponse Adjustment Classes 

School-level nonresponse adjustment classes were created separately for public and 
nonpublic schools within each state. For each set these classes were defined as a function of 
their sampling strata, as detailed next. 

Public Schools. For each state, the initial school nonresponse adjustment classes were 
formed by crossclassifying the level of urbanization and minority status (see Chapter 3 for 
definitions of these characteristics). Where there were no minority strata within a particular 
level of urbanization, a categorized version of median income was used. For this purpose within 
each level of urbanization, public schools were sorted by the median income, and then divided 
into three groups of about equal size, representing low. midiile. and high income areas. 

1.32 




151 



Nonpublic Schools. For each state (excluding District of Columbia and Guam nonpublic 
schools), nonresponse adjustment classes were formed by crossclassifying school type (Catholic 
and nonCatholic) and metropolitan status (metro/nonmetro) area. For District of Columbia 
nonpublic schools these classes were defined by crossclassifying school type and two levels of 
estimated grade enrollment (25 or fewer students, versus 26 or more students). For Guam, 
initial nonresponse classes for nonpublic schools were defined by school type only. The District 
of Columbia is entirely metropolitan, and Guam is entirely nonmetropolitan, so alternatives 
were needed for these two jurisdictions. 

Department of Defense Educational Activity (DoDEA) Overseas Schools. For the 
jurisdiction comprised of DoDEA Overseas schools, there was only one nonresponding school. 
This school, along with the remaining schools in the Atlantic region, formed the first 
nonresponse class while all remaining DoDEA Overseas schools were assigned to the second 
nonresponse class. 



132 Constructing the Final Nonresponse Adjustment Classes 

The objective in forming the nonresponse adjustment classes is to create as many classes 
as possible which are internally as homogeneous as possible, but such that the resulting 
nonresponse adjustment factors are not subject to large random variation. Consequently, all 
initial nonresponse adjustment classes deemed unstable were collapsed with suitable neighboring 
classes so that: (1) the combined class contained at least 6 schools, and (2) the resulting 
nonresponse adjustment factor did not exceed 1.35 (in a few cases a factor slightly in excess of 
1.35 was permitted). These limits had been used for the 1992 Trial State Assessment. 

Public Schools. For these schools, inadequate nonresponse adjustment classes were 
reinforced by collapsing adjacent levels of minority status (or median income level if minority 
information was missing). In doing so, different categories of urbanization were not mixed (to 
the extent possible). 

Nonpublic Schools. For nonpublic schools, excluding schools in District of Columbia and 
Guam, inadequate classes were reinforced by collapsing adjacent levels of metropolitan area 
status. Catholic and nonCatholic schools were kept apart to the extent possible, particularly 
when the only requirement to combine such schools was as a means of reducing the adjustment 
factors below 1.35. For schools in the District of Columbia, inadequate classes were collapsed 
over similar values of estimated grade enrollment. Catholic and nonCatholic schools were kept 
apart to the extent possible. For nonpublic schools in Guam, Catholic and nonCatholic schools 
were collapsed together in order to form a stable nonresponse adjustment class. 



73 J School Nonresponse Adljustment Factors 



The school-level nonresponse adjustment factor for the jth school in the hth class was 
computed as; 



where 

C* 

E. 



EK% 

p(l) _ 



E 



the subset of school records in class h; 

the base weight of the z'th school in class h; 

the QED grade enrollment for the zth school in class h; 

1 if the zth school in adjustment class h participated in the 
assessments; and 



0 otherwise. 



In the calculation of the above nonresponse adjustment factors, a school was said to have 
participated if: 

• It was selected for the sample from the QED frame or from the lists of new 
schools provided by participating school districts, and student assessment data 
were obtained from the school; or 

• The school participated as a substitute school and student assessment data were 
obtained (so that the substitute participated in place of the originally selected 
school). 

Both the numerator and denominator of the nonresponse adjustment factor contained only 
in-scope schools. 

The nonresponse-adjusted weight for the zth school in class h was computed as: 




134 

153 



o 




7 J.4 Student Nonresponse Adjustment Classes 

The initial student nonresponse classes were formed using the final school nonresponse 
classes, crossclassified by the quality control monitoring status (see section 3.5.4) and student 
age. Age was used to classify students into two groups (those bom in September 1983 or earlier 
and those bom in October 1983 or later). Following creation of the initial student nonresponse 
adjustment classes, all weak classes were identified for possible reinforcements. A class was 
considered to be unstable when any of the following conditions was true for the given class: 

• Number of responding eligible students was fewer than 20; 

• Nonresponse adjustment factor exceeded 2.0; and 

• Number of responding eligible students was fewer than 31 and nonresponse 
adjustment factor exceeded 1.5. 

All classes deemed unstable in the previous step were collapsed with other classes using 
the following rules: 

• Collapsed across monitoring status within all other classes; 

• If a resulting class still needed to be collapsed, then the previous collapsing was 
undone, and now collapsed across minority/income categories; and 

• If a resulting class still needed to be collapsed, it was further collapsed across the 
three fields — monitor status, urbanization level, and age category — in that order. 



73S Student Nonresponse Adjustments 

As described above, the student-level nonresponse adjustments for the assessed students 
were made within classes defined by the final school-level nonresponse adjustment classes, 
monitoring status of the school, and age group of the students. Subsequently, in each state, the 
final student weight for the jth student of the ith school in class k was then computed as: 

X X F* X 



where 



W!^ = the nonresponse-adjusted school weight for school /; 

= the within-school weight for the /th student in school /; 



135 




15-i 



and 



F,.-i . 

E 

J 

In the above formulation, the summation included all students, ;, in the A:th final (collapsed) 
nonresponse class. The indicator variable had a value of 1 when the /th student in 
adjustment class k participated in the assessment; otherwise, 6ig = 0. 

For excluded students the same basic procedures as described above for assessed 
students were used, except that the numerator and denominator contained excluded rather than 
assessed students, and monitoring status and student age group were not used to form the 
adjustment classes. Weights are provided for excluded students so as to estimate the size of this 
group and its population characteristics. Table 7-1 summarizes the unweighted and final 
weighted counts of assessed and excluded students for each state. 



7.4 CHARACTERISTICS OF NONRESPONDING SCHOOLS AND STUDENTS 

In the previous section procedures were described for adjusting the survey weights so as 
to reduce the potential bias of nonparticipation of sampled schools and students. To the extent 
that a nonresponding school or student is different from those respondents in the same 
nonresponse adjustment class, potential for nonresponse bias remains. 

In this section, we examine the potential for remaining nonresponse bias in two related 
ways. First we examine the weighted distributions, within each grade and state, of certain 
characteristics of schools and students, both for the full sample and for respondents only. This 
analysis is of necessity limited to those characteristics that are known for both respondents and 
nonrespondents, and hence cannot directly address the question of nonresponse bias. The 
approach taken does reflect the reduction in bias obtained through the use of nonresponse 
weighting adjustments. As such, it is more appropriate than a simple comparison of the 
characteristics of nonrespondents with those of respondents for each state. 

The second approach involves modeling the probability that a school is a nonrespondent, 
as a function of the nonresponse adjustment class within which the school is located, together 
with other school characteristics. This has been achieved using linear logistic regression models, 
with school response status as the dependent variable. By examining how much better one can 
predict school nonresponse using school characteristics, over and above using the membership of 
the nonresponse adjustment class to make this prediction, we can obtain some insight into the 
remaining potential for nonresponse bias. If these factors are substantially marginally predictive, 
there is a danger that significant nonresponse bias remains. These models have been developed 
for public schools in each of the seven states having public school participation (after 
substitution) of below 90 percent (with a participation rate prior to substitution in excess of 70 
percent). 



136 




155 



Table 7-1 

Unweighted and Final Weighted Counts of Assessed and Excluded Students by Jurisdiction 





Assessed 


Excluded 


Assessed and Excluded 


Jurisdiction 


Unweighted 


Weighted 


Unweighted 


Weighted 


Unweighted 


Weighted 


Alabama 


2,845 


57,099 


163 


3,131 


3,008 


60,230 


Arizona 


2,651 


52,297 


191 


3,899 


2,842 


56,196 


Arkansas 


2,689 


32,550 


167 


1,978 


2,856 


34,528 


California 


2,401 


370,558 


358 


48,031 


2,759 


418,589 


Colorado 


2,860 


51,259 


204 


3,792 


3,064 


55,052 


Connecticut 


2,868 


38,888 


237 


3,250 


3,105 


42,138 


Delaware 


2,783 


9,239 


146 


503 


2,929 


9,742 


DoDEA Overseas 


2,413 


8,350 


108 


399 


2,521 


8,749 


District Of Columbia 


2,913 


6,241 


262 


507 


3,175 


6,748 


Florida 


2,933 


168380 


302 


17,846 


3,235 


186,225 


Georgia 


2,983 


102,798 


164 


5,548 


3,147 


108,346 


Guam 


2,575 


2,693 


192 


220 


2,767 


2,913 


Hawaii 


3,147 


15,474 


141 


714 


3,288 


16,188 


Idaho 


2,692 


17,922 


140 


947 


2,832 


18.869 


Indiana 


2,874 


75,590 


153 


3,787 


3,027 


79,377 


Iowa 


3,086 


39,125 


133 


1,752 


3,219 


40,877 


Kentucky 


3,036 


50,820 


108 


1,850 


3,144 


52,670 


Louisiana 


3,170 


64,663 


165 


3,572 


3,335 


68,236 


Maine 


2,521 


15,837 


257 


1.682 


2,778 


17.519 


Maryland 


2,830 


61,712 


205 


4.408 


3,035 


66,120 


Massachusetts 


2,819 


65,486 


237 


5,118 


3,056 


70,604 


Michigan 


2,142 


112,908 


122 


6,954 


2,264 


119,862 


Minnesota 


3,045 


67,251 


115 


2,661 


3,160 


69,912 


Mississippi 


2,918 


39,288 


169 


2,340 


3,087 


41,628 


Missouri 


3,042 


68,884 


152 


3,211 


3,194 


72.094 


Montana 


2,649 


12,660 


86 


430 


2,735 


13,089 


Nebraska 


2,606 


26,458 


103 


1,063 


2,709 


27,521 


New Hampshire 


2,197 


14,296 


132 


896 


2,329 


15.192 


New Jersey 


2,888 


93,268 


155 


5,208 


3,043 


98,476 


New Mexico 


2,826 


24,972 


239 


2.219 


3,065 


27,191 


New York 


2,864 


222,969 


221 


16,357 


3,085 


239.326 


North Carolina 


2,833 


79,806 


169 " 


4,493 


3,002 


84,299 


North Dakota 


2,797 


9.847 


65 


221 


2,862 


10,068 


Pennsylvania 


2,717 


141,774 


137 


7,454 


2,854 


149,228 


Rhode Island 


2,6% 


11,995 


133 


582 


2,829 


12,577 


South Carolina 


2,863 


49,988 


185 


3,340 


3,048 


53.328 


Tennessee 


1,998 


57,433 


112 


3.725 


2,110 


61,157 


Texas 


2,454 


238.075 


288 


29.116 


2,742 


267,191 


Utah 


2,733 


31,893 


138 


1.712 


2.871 


33,605 


Virginia 


2,870 


79,774 


214 


5,592 


3,084 


85,366 


Washington 


2,737 


67,089 


143 


3.618 


2,880 


70.707 


West Virginia 


2,887 


23,407 


212 


1.614 


3,099 


25,022 


Wisconsin 


2,719 


68,292 


181 


4.370 


2.900 


72,662 


Wyoming 


2,699 


7398 


116 


330 


Z815 


7.728 


Total 


121,269 


2,856.703 


7.620 


220.439 


128,889 


3.077,142 



1.^7 




15o 



7.4.1 Weighted Distributions of Schools Before and After School Nonresponse 

Table 7-2 shows the mean values of certain school characteristics for public schools, both 
before and after nonresponse. The means are weighted appropriately to reflect whether 
nonresponse adjustments have been applied (i.e., to respondents only) or not (to the full set of 
in-scope schools). The variables for which means are presented are the percentage of students 
in the school who are Black, the percentage who are Hispanic, the median income (1989) of the 
ZIP code area where the school is located, and the type of location. All variables were obtained 
from the sample frame, described in Chapter 3, with the exception of the type of location. This 
variable was derived for each sampled school using census data. The type of location variable 
has seven possible levels, which are defined in section 3.4.2. Although this variable is not 
interval-scaled, the mean value does give an indication of the degree of urbanization of the 
population represented by the school sample (lower values for type of location indicate a greater 
degree of urbanization). 

Two sets of means are presented for these four variables. The first set shows the 
weighted mean derived from the full sample of in-scope schools selected for reading; that is, 
respondents and nonrespondents (for which there was no participating substitute). The weight 
for each sampled school is the product of the school base weight and the grade enrollment. This 
weight therefore represents the number of students in the state represented by the selected 
school. The second set of means is derived from responding schools only, after school 
substitution. In this case the weight for each school is the product of the nonresponse-adjusted 
school weight and the grade enrollment, and therefore indicates the number of students in the 
state represented by the responding school. 

Table 7-3 shows some of these same statistics for all schools combined, for those states 
where both the public school participation rate prior to substitution, and the nonpublic school 
participation rate prior to substitution, exceeded 70 percent. These are the states for which 
assessment results have been published for both public and nonpublic schools combined. Data 
on minority enrollment were not available for nonpublic schools, and so are not included in 
Table 7.3. 

The differences between these sets of means give an indication of the potential for 
nonresponse bias that has been introduced by nonresponding schools for which there was no 
participating substitute. For example, in Arkansas at grade 4 the mean percentage Black 
enrollment, estimated from the original sample of public schools, is 24.50 percent (Table 7-2). * 
The estimate from the responding schools is 24.36 percent. Thus there may be a slight bias in 
the results for Arkansas because these two means differ. Note, however, that throughout these 
two tables the differences in the two sets of mean values are generally very slight, at least in 
absolute terms, suggesting that it is unlikely that substantial bias has been introduced by schools 
that did not participate and for which no substitute participated. Of course in a number of 
states (as indicated) there was no nonresponse at the school level, so that these sets of means 
are identic 2 il. Even in those jurisdictions where school nonresponse was relatively high (such as 
Tennessee, Nebraska, New Hampshire, and Michigan), the absolute differences in means are 
slight. Occasionally the relative difference is large (the "Percent Black" in Wisconsin, for 
example), but these are for small population subgroups, and thus are very unlikely to have a 
large impact on results for the jurisdiction as a whole. 




Table 7-2 

Weighted Mean Values Derived from Sampled Public Schools 



Jurisdktioii 


Weighted 
Participation 
Rate After 
Substitution 
(%) 


Weighted Mean Values Derived ftpom Full 
Sample 


Weighted Mean Values Derived from 
Responding Sample, with Substitutes and 
School Nonresponse A<yustment 


Percent 

Black 


Percent 

Hispanic 


Median 

Income 


l>p€ of 
Location 


Percent 

Black 


Percent 

Hispanic 


Median 

Income 


Type of 
Location 


Alabama 


93.39 


34.94 


0.05 


$23,860 


4.37 


35.16 


0.05 


$24,032 


4.35 


Arizona 


99.04 


3.49 


23.22 


$31,020 


2.52 


3.52 


23.32 


$31,058 


2.51 


Arkansas 


94.09 


24.50 


0.23 


$22,561 


4.78 


24.36 


0.20 


$22,648 


4.81 


California 


90.52 


6.79 


36.85 


$35,591 


2.77 


6.82 


35.70 


$35,766 


2.75 


Colorado 


100.00 


4.23 


15.71 


$32,485 


3.38 


4.19 


15.92 


$32,387 


3.40 


Connecticut 


96.47 


13.39 


11.21 


$44,520 


3.34 


12.87 


10.81 


$44,678 


3.33 


Delaware 


100.00 


28.53 


1.99 


$26,983 


3.19 


28.53 


1.99 


$26,983 


3.19 


DoDEA Overseas 


99.25 


— 


— 


— 


— 


— 


— 








Dist. of Columbia 


100.00 


88.78 


5.06 


$27,898 


1.00 


88.78 


5.06 


$27,898 


1.00 


Florida 


100.00 


25.10 


11.32 


$28,688 


3.31 


25.10 


11.32 


$28,688 


3.31 


Georgia 


99.05 


28.35 


1.16 


$30,537 


4.57 


28.39 


1.16 


$30,526 


4.57 


Guam 


100.00 


2.10 


0.29 


— 


— 


2.10 


0.29 


— 


— 


Hawaii 


99.07 


1.86 


2.71 


$35,436 


4.34 


1.88 


2.72 


$35,424 


4.34 


Idaho 


91.45 


0.18 


5.39 


$26,063 


5.03 


0.20 


5.19 


$26,091 


4.97 


Indiana 


92.48 


10.95 


1.40 


$27,947 


3.87 


10.96 


1.40 


$28,165 


3.86 


Iowa 


99.05 


2.70 


1.05 


$27,499 


4.99 


2.70 


1.06 


$27,404 


5.01 


Kentucky 


96.16 


9.64 


0.10 


$24,022 


5.15 


9.65 


0.10 


$23,782 


5.16 


Louisiana 


100.00 


43.87 


1.32 


$23,401 


3.90 


43.87 


1.32 


$23,401 


3.90 


Maine 


96.99 


0.09 


OSO 


$29,054 


5.64 


0.09 


0.49 


$29,191 


5.65 


Maryland 


96.15 


32.87 


1.69 


$40,496 


2.67 


32.94 


1.76 


$40,191 


2.71 


Massachusetts 


97.02 


8.65 


7.91 


$41,722 


3.09 


8.22 


8.36 


$41,567 


3.11 


Michigan 


79.77 


16.64 


2.27 


$33,009 


3.72 


17.37 


2.26 


$33,078 


3.72 


Minnesota 


95.22 


3.47 


0.62 


$33,478 


4.11 


3.57 


0.64 


$33,514 


4.12 


Mississippi 


99.04 


49.32 


0.07 


$21,249 


5.37 


49.46 


0.07 


$21,268 


5.36 


Missouri 


98.40 


10.06 


1.29 


$29,013 


3.85 


9.88 


1.29 


$29,107 


3.86 


Montana 


88.58 


0.29 


1.39 


$24,675 


5.28 


0.31 


1.43 


$24,682 


5.28 


Nebraska 


77.25 


4.02 


2.59 


$27,787 


4.92 


2.87 


3.03 


$26,479 


5.09 


New Hampshire 


79.20 


0.40 


0.53 


$39,829 


4.52 


0.38 


0.51 


$39,847 


4.51 


New Jersey 


91.29 


17.96 


13.86 


$42,647 


3.22 


18.83 


13.95 


$42,032 


3.21 


New Mexico 


100.00 


1.84 


45.87 


$24,273 


4.27 


1.84 


45.87 


$24,273 


4.27 


New York 


90.57 


20.72 


15.60 


$34,849 


2.76 


20.46 


15.97 


$34,351 


2.75 


North Carolina 


99.05 


29.69 


0.70 


$27,929 


4.33 


29.60 


0.69 


$28,036 


4.33 


North Dakota 


91.19 


0.62 


0.51 


$27,229 


5.09 


0.65 


0.53 


$27,203 


5.04 


Pennsylvania 


83.69 


14.91 


2.20 


$31,527 


3.13 


16.04 


2.32 


$31,238 


3.13 


Rhode Island 


85^4 


6.39 


6.73 


$31,585 


2.94 


6.80 


6.67 


$31,486 


2.97 


South Carolina 


97.15 


42.24 


0.20 


$26373 


4.55 


42.17 


0.20 


$26,594 


4.54 


Tennessee 


73.79 


22.39 


0.15 


$25,243 


3.63 


24.23 


0.16 


$23,897 


3.65 


Texas 


93.20 


12.08 


35.47 


$27,869 


2.74 


11.69 


35.49 


$27,681 


2.75 


Utah 


100.00 


0.36 


3.42 


$32,643 


3.95 


0.36 


3.42 


$32,643 


3.95 


Virginia 


99.05 


21.27 


1.77 


$39,125 


3.65 


21.33 


1.78 


$39,124 


3.65 


Washington 


100.00 


4.12 


5.13 


$34,341 


232 


4.12 


5.13 


$34,341 


3.52 


West Virginia 


100.00 


2.93 


0.06 


$22,277 


5.34 


2.93 


0.06 


$22,277 


5.34 


Wisconsin 


85.56 


9.20 


2.41 


$32,677 


3.86 


5.75 


2.62 


$32,841 


3.92 


Wyoming 


98.38 


0.36 


4.39 


$31,446 


5.19 


0.37 


4.45 


$31,473 


5.19 



139 



15a 




Table 7-3 

Weighted Mean Values Derived from All Sampled Schools for Jurisdictions Achieving Minimal Required 
Public- and Nonpublic-school Participation, Before Substitution 



Jurisdictioa 


Weighted 
Participation 
Rate After 


Weighted Mean Values Derived from 
Full Sample 


Weighted Mean Values Derived from 
Responding Sample, with Substitutes 
and School Nonresponse AcUustment 


Substitution (%) 


Median Income 


Type of Location 


Median Income 


Type of Location 


Alabama 


96.77 


S24,022 


432 


524,198 


4.29 


Arkansas 


94.70 


$22,837 


4.69 


522,934 


4.71 


Colorado 


91.54 


$32,449 


336 


$32,386 


338 


Connecticut 


84.14 


$44,818 


3.28 


$45,008 


3.28 


Delaware 


79.92 


$29,135 


3.01 


$29,489 


3.06 


Georgia 


93.69 


$30,800 


4.50 


$30,876 


4 50 


Guam 


95.61 


— 


— 


— 


— 


Hawaii 


90.07 


$35,845 


4.17 


$35,709 


4.14 


Iowa 


99.68 


$27,450 


4.99 


$27364 


5.00 


Indiana 


76.54 


$28,167 


3.80 


$28,403 


3.78 


Kentucky 


87.42 


$24,619 


4.93 


$24,129 


4.96 


Louisiana 


91.83 


$23,760 


3.73 


$23,662 


3.74 


Massachusetts 


99.01 


$41,268 


3.01 


$41,129 


3.03 


Maine 


99.00 


$29,046 


5.59 


$29,178 


5.60 


Minnesota 


97.15 


$33,409 


4.11 


$33,445 


4.12 


Missouri 


94.50 


$29,694 


3.66 


$29,675 


3.68 


North Dakota 


88.65 


$27,184 


5.10 


$27,224 


5.04 


New Jersey 


71.12 


$42,522 


3.21 


$41,661 


3.21 


New Mexico 


100.00 


$24,199 


4.24 


$24,199 


4.24 


Pennsylvania 


64 42 


$31,893 


3.04 


$31,453 


3.13 


Rhode Island 


8037 


$31,702 


2.98 


$31,565 


3.00 


Virginia 


80.92 


$39,252 


3.61 


$39,428 


3.60 


West Virginia 


1 91.84 


$22,405 


5.25 


$22,430 


5.26 



140 

,i5a 



1A2 Characteristics of Schools Related to Response 

In an effort to evaluate the possibility that substantial bias remains as a results of school 
nonparticipation, following the use of nonresponse adjustments, a series of analyses were 
conducted on the response statuses for public schools. This analysis was restricted to those 
states with a participation rate below 90 percent (after substitution), since these are the states 
where the potential for nonresponse bias is likely to be the greatest. We did not include those 
states with an initial public school response rate was below 70 percent, since NAEP does not 
report results for these states because of concern about nonresponse bias, private schools were 
omitted from these analyses because of the small sample sizes involved, which mean that it is 
difficult to assess whether a potential for bias exists. 

The seven states investigated were the following (with the public school participation rate 
shown in parentheses): Montana (89%), Nebraska (77%), New Hampshire (79%), Pennsylvania 
(84%), Rhode Island (86%), Tennessee (74%), and Wisconsin (86%). The approach used was 
to develop logistic regression models within each state, to predict the probability of participation 
as a function of the nonref^ mse adjustment classes, and other school characteristics. The aim 
was to determine whether the response rates are significantly related to school characteristics, 
after accounting for the effect of the nonresponse class. Thus dummy variables were created to 
indicate nonresponse class membership, and an initial model was created which predicted the 
probability of school participation as function of nonresponse class. 

If there are k nonresponse classes within a state, let 

X,j = 1 if the school j is classified in nonresponse class / 

= 0 otherwise, for / = l,...,(k-l) 

Let Pj denote the probability that school j is a participant, and let Lj denote the logit of Pj That 
is. 



Lj = ln(P/(l - Pj)). 

The initial model fitted for each state was 

Lj = a + i:b^,j, ( 1 ) 

The value of -2 log likelihood for this model, together with its degrees of freedom (k-1), 
are presented in Table 7-4, under the heading "Model with Only Nonresponse Classes". This 
constitutes a baseline, against which a second model was compared, as discussed below. Note 
that this model cannot be estimated if there are nonresponse classes in which all schools 
participated (so that no adjustments for nonresponse were made for schools in such a class). 
Even though this analysis was restricted to those states with relatively poor response, this 
occurred in a number of instances. When this happened, those (responding) schools in such 
classes were dropped from the analyses. Table 7-4 shows the proportion of the state public- 
school student population that is represented in the sample by schools from classes with less 
than 100 percent response. Thus in Nebraska, New Hampshire, and Tennessee, there was some 
nonresponse within every adjustment class, whereas for the other four states some portion of the 
population is not represented because schools were dropped from classes with no nonresponse. 

141 




160 



Table 7-4 



Results of Logistic Regression Analyses of School Nonresponse 



Jurisdiction 


School 
PartkipaCion 
Rate (%) 


Percent of 
Population Covered 
hy Models* 


Model with Only 
Nonresponse Classes 


Model with AU Variables 


Best Model - Significant 
Variables** 


-2 Log 
Likelihood 


Degrees of 
Freedom 


Change in 
-2 Log 
Likelihood 


Change in 
Degrees of 
Freedom 


Significance 


Montana 


89 


61.9 


2.627 


4 


1.648 


4 


N.S. 


None 


Nebraska 


77 


100.0 


0.618 


3 


25.701 


5 


p < .005 


Ys - Median income (p = .0001) 
Yi - Percent Black (p = .0157) 


New Hampshire 


79 


100.0 


7.861 


1 


5.829 


6 


N.S. 


None 


Pennsylvania 


84 


78.8 


0.291 


3 


7.813 


5 


N.S. 


Zj - Type of locator (p = .0323) 


Rhode Island 


86 


90.4 


21.741 


7 


4.829 


7 


N.S. 


None 


Tennessee 


74 


100.0 


2.917 


5 


11.700 


5 


.025 < p < .05 


Yj - Median income (p = .0132) 


Wisconsin 


86 


64.8 


0.911 


4 


21.534 


6 


p < .005 


Y, - Percent Black (p = .0065) 



• For the remainder of the population (not covered by the models) there was 100 percent participation. 

Variables (in addition to the nonresponse classes) included in the best model obtained by a backwards stepwise procedure. 




16 



142 



O 

ERIC 

U5WIBISHS!SIfl55S 



As an aside, these values for the log likelihood statistics show that in New Hampshire 
the response rates were significantly different between the two nonresponse classes, whereas the 
differences among the four classes in Nebraska were not statistically significant, nor among the 
six classes in Tennessee. This does not demonstrate that there was no benefit derived from the 
school nonresponse adjustments in Nebraska and Tennessee, as this analysis may be lacking in 
power, but it is suggestive of this possibility. 

Within each state a second logistic model was fitted to the data on public school 
participation. In this model, the same indicator variables for nonresponse class were included, 
and also additional variables available for participating and nonparticipating schools alike. 

These variables were the percentage of Black students (Y,), the percentage of Hispanic students 
(Yj), the estimated grade 4 enrollment size of the school (Y^), the median 1989 household 
income of the zip code area in which the school was located (YJ, and a set of indicator variables 
indicating the type of location of the school. These type of location classes were the seven 
categories of the NCES type of location variable, described in Chapter 3. However, states did 
not each have six dummy variables for this classification for three reasons. First, most states are 
missing some of the categories. Second, it was necessary to collapse categories so that the 
collapsed classes did not have all schools as participants, and all as nonparticipants. Finally, 
since type of location classes were used in forming nonresponse adjustment classes, they are 
frequently confounded with the indicator variables for these classes. Thus the number of 
variables indicating type of location were 0 in Montana, 1 in Nebraska, 2 in New Hampshire, 1 
in Pennsylvania, 3 in Rhode Island, 1 in Tennessee, and 2 in Wisconsin. These variables are 
denoted as Z„ for i from 1 to the number given above. Thus in New Hampshire there are two 
variables, Z, and Zj. 

The model fitted in each state now was the following: 

Lj= A + E_C,Y;^ + (2) 

The explanatory power of this model was compared with that of the initial one by 
examining the change in the value of -2 log likelihood, and assessing the statistical significance 
of this change. This evaluates whether, taken as a group, the Y, and Z variables are significantly 
related to the response probability, after accounting for nonresponse class. The results are 
shown in Table 7-4 under the heading "Model with All Variables". 

The table shows that in Montana, New Hampshire, Pennsylvania, and Rhode Island, we 
are unable to detect any effect of the additional variables. In the other three states, however, 
these additional variables significantly explain variation in response rates, not accounted for by 
nonresponse class. This is in spite of the fact that functions of these variables were used in 
defining nonresponse adjustment classes, as described earlier in this chapter, and in Chapter 3 
where the stratification for each state is described. 

The final step in the analysis was to attempt to isolate which of the additional variables 
was able to contribute to the explanation of variation in response rate. This was done by fitting 
a logistic regression model, using a backwards stepwise elimination procedure to develop a 
parsimonious model. The starting point was the model (2) above, and nonsignificant variables Y 
and Z were removed until only the X variables, and significant Y and Z variables were retained. 



143 



163 



The righthand column of Table 7-4 shows the ensuing variable selection for each state, along 
with the statistical significance of each retained variable. 

This analysis shows that, for Nebraska, both the percent of Black students enrolled, and 
the median household income, were highly significant predictors, over and above nonresponse 
class. This occurs despite the fact that in Nebraska, minority enrollment was used in forming 
nonresponse adjustment classes within metropolitan areas (see Table 3-3). Median income 
classes were used to form nonresponse classes in nonmetropolitan areas, but evidently this did 
not capture the full explanatory power of this variable. The significance of these two variables is 
reflected in the results in Table 7-2. The full sample has a mean percent Black of 4.02 percent, 
whereas for the adjusted responding sample the mean percentage is 2.87 percent. The mean 
median household income for the full sample is $27,787, whereas for the respondents it is 
$26,247. Thus there is indication that the final sample is somewhat under representative of 
schools with relatively high Black enrollment, and relatively high median household income. 

For Pennsylvania, the single variable designating type of location is somewhat significant, 
even though this variable features prominently in the formation of nonresponse adjustment 
classes. This significance does not translate into the results of Table 7-2, since mean value for 
type of location for the full sample is 3.129, which is very close to the value of 3.133 for the 
respondents. 

For Tennessee, the median income variable is somewhat significant. This variable was 
used in forming nonresponse adjustment classes in Tennessee only in rural areas, as minority 
enrollment was used in other areas (see Table 3-3). The median income for the full sample is 
$25,243, while for the respondents it is $23,897. This indicates that the final sample is somewhat 
underrepresentative of schools with relatively high median income. 

For Wisconsin, the variable giving the percentage of Black enrollment is highly 
significant. Minority enrollment was used in forming nonresponse classes only with large central 
cities, and not elsewhere in the state (Table 3-3). This differential in nonresponse for schools 
with different levels of Black enrollment is reflected in Table 7-2. This shows that the mean 
percent Black for the full sample is 9.20 percent, but for the final sample it is only 5.75 percent. 
This indicates that the sample is likely under representative of schools with relatively high Black 
enrollment. 

These results indicate that on occasion there are differences between the original 
samples of schools, and those that participated, that are not fully removed by the process of 
creating nonresponse adjustments. Although these effects are not dramatic, they are statistically 
significant, and generally are reflected in noticeable differences iri population characteristics 
estimated from the respondents, compared to those obtained for the full sample. However, the 
evidence presented here does not permit valid speculation about the likely size or even direction 
of the bias in the states where these sample differences are noticeable. 



7.4J Weighted Distributions of Students Before and After Student Absenteeism 

Table 7-5 shows, for the public schools in each state, the weighted sampled percentages 
of students by gender (male) and race/ethnicity (White, not Hispanic; Black, not Hispanic; 



Table 7-5 

Weighted Student Percentages Derived from Sampled Public Schools 



JorUdtctkm 


Wetthted 

Studeat 


Weighted Eulnutcs Deri red from FsU Sample 


Weighted Estimates Dcrhtd from Assessed Sample, with Stndent Nooretponse 
Adjustment 


Partlcipa> 
tkm (%) 


Fcrceiit 

Male 


Percent 

White 


Percent 

Black 


Perceat 

Hispanic 


Percent 

lEP 


Percent 

LEP 


Mean Age 
(Months) 


Percent 

Male 


Percent 

While 


Percent 

Black 


Percent 

Hispanic 


Percent 

lEP 


Percent 

LEP 


Mean Age 
(Months) 


Alabama 


96.07 


50.72 


62.67 


28.92 


534 


5.89 


0.14 


119.27 


50.76 


63.69 


27.85 


531 


5.80 


0.10 


119.28 


Arizona 


94.27 


49.76 


58.32 


3.63 


2838 


6.44 


9.02 


119.23 


4932 


58.32 


3.63 


2838 


6.40 


9.15 


119.23 


Arkansas 


95.96 


50.06 


70.66 


2038 


6.08 


6.14 


0.30 


119.19 


49.90 


71.28 


1936 


6.14 


6.05 


6.40 


119.19 


California 


93.86 


50.86 


44.03 


7.01 


33.38 


5.81 


16.09 


116.42 


50.90 


4580 


6,75 


32.18 


5.14 


1633 


116.40 


Colorado 


94.25 


49.27 


68.16 


4.77 


20.85 


5.96 


2,43 


118.63 


49.78 


69.39 


4.49 


20.00 


5.86 


2.40 


118.63 


Conneciicui 


95.63 


49.66 


70.64 


11.80 


13.66 


8.25 


1.21 


116.96 


49.73 


72,70 


10.71 


13.02 


8.22 


1,25 


116.97 


Delaware 


9538 


49.46 


63.21 


23.81 


9.18 


9.16 


0.60 


llb.42 


49.22 


67,09 


20.71 


8.62 


9,13 


0.63 


116.40 


DoDEA Overseas 


9438 


49,84 


47.87 


19.10 


16.90 


3.79 


135 


117.00 


50.06 


47.87 


19.10 


16.90 


337 


137 


117.00 


Dist. of Columbia 


9432 


50.16 


5.33 


79.73 


12.15 


137 


1.75 


117.95 


49.81 


836 


76,74 


11.90 


131 


1.85 


117.91 


Flonda 


9196 


48.87 


57 08 


21 18 


18.78 


9.87 


336 


11935 


48.84 


59.20 


19.40 


18.02 


9.79 


335 


11938 


Geoipa 


95.45 


48.40 


55.83 


32.25 


830 


5.07 


1.02 


119.81 


48.41 


57.90 


30,42 


827 


5.19 


0.94 


119.80 


Guam 


9591 


50.04 


9.06 


333 


17.12 


0.48 


3.05 


11530 


50.66 


8.97 


334 


16.26 


030 


2.90 


115.49 


Hawaii 


95.45 


50.85 


17.02 


306 


20 22 


401 


335 


114.86 


5034 


18.33 


273 


18.72 


3.84 


3.66 


114.86 


Idaho 


%.12 


50.62 


81.09 


037 


13.30 


6.05 


1.79 


118.18 


50.16 


81.45 


035 


13.01 


5.92 


1.74 


118.17 


Indiana 


95.86 


49.47 


81.09 


10.06 


6.48 


6.10 


0.23 


12035 


49.18 


8034 


10.49 


634 


6.14 


0.21 


12034 


Iijwa 


9534 


50.95 


88.36 


2.73 


5.84 


6.74 


0.20 


119.78 


51.07 


8836 


2.60 


6.04 


630 


021 


119.79 


Kcmucky 


%.68 


51.21 


83.84 


9.73 


4.44 


4.02 


0.24 


118.84 


51.19 


84.10 


9.14 


4.70 


3.95 


0.25 


118.84 


louisiana 


%oe 


50.04 


50.86 


38 83 


7.47 


5.16 


036 


12032 


49^1 


5438 


35-27 


7.02 


5.03 


038 


120.45 


Maine 


94.32 


50.13 


92 04 


0.74 


4.33 


7.30 


0.17 


119.39 


50.12 


92.08 


0.75 


4.25 


723 


0.18 


119.39 


Maryland 


95 20 


52 82 


5719 


31 77 


601 


7.75 


0.71 


115.25 


52.11 


58.89 


30.22 


5.81 


7.76 


0.74 


11521 


Massachusetts 


95 43 


50.35 


76 94 


744 


10.69 


10.36 


1.43 


11765 


49.71 


76.89 


735 


1036 


10.11 


1.43 


117.65 


Michi^n 


M.87 


4922 


72.78 


15.15 


7.80 


334 


031 


118.24 


48.92 


72-78 


15.15 


7.80 


3.60 


0.46 


118.21 


Minnesota 


95.49 


5109 


83.93 


2.93 


7.98 


7.09 


0.83 


119 32 


50.89 


84.97 


2.83 


7.47 


7.29 


0.84 


119.32 


Mississippi 


96.68 


48.84 


45.98 


45 35 


6.70 


3.46 


0.15 


120.92 


48.60 


48.90 


42.69 


6.44 


337 


0.15 


120.90 


Missouri 


95 04 


51.22 


7536 


14 12 


638 


7.13 


0.03 


120 35 


51,45 


76.76 


1Z94 


6.78 


7.02 


0.03 


120.32 


Montana 


95.72 


50 73 


79.06 


053 


9.56 


7.45 


1.00 


120.32 


50.61 


78.45 


034 


9.39 


728 


1.01 


120.30 


Nebraska 


94 85 


50.78 


82 26 


3.74 


9.40 


11.75 


036 


119.15 


50.92 


91.28 


3.34 


8.81 


11.69 


034 


119.11 


New Hampshire 


9538 


4969 


91.28 


098 


436 


9.71 


0.17 


11932 


49.63 


60.74 


0.98 


436 


9.77 


0.17 


11931 


New jeney 


95.30 


48.80 


60.35 


16.08 


1708 


5.37 


132 


117.47 


48.62 


39 03 


15.41 


1739 


5.28 


138 


117.44 


New Memco 


94 68 


4788 


41 14 


290 


43.88 


8.64 


237 


118.93 


47.86 


5669 


301 


42.01 


839 


2.65 


118.91 




145 



IG 



u 



O 



Table 7-5 (continued) 

Weighted Student Percentages Derived from Sampled Public Schools 



JuHtdktlMi 


Weighted 
Student 
Partkipa- 
tkm {%) 


Weighted Estlnutes Derived firom FuU Sample 


Weighted Estimates Derived firom Assessed Sample, with Student Nonresponse 
Adjustment 


Percent 

Male 


Percent 

White 


Percent 

Black 


Percent 

Hispanic 


Percent 

lEP 


Percent 

LEP 


Mean Age 
(Months) 


Percent 

Male 


Percent 

White 


Percent 

Black 


Percent 

Hispanic 


Percent 

lEP 


Percent 

LEP 


Mean Age 
(Months) 


New York 


95.34 


50.05 


54.44 


20.63 


19.21 


4.38 


3.64 


115.98 


49.84 


65.49 


1959 


17.98 


4.39 


3.60 


115.97 


North Carolina 


95.83 


50.81 


65.49 


26.11 


3.74 


9.34 


0.49 


118.23 


50.72 


83.66 


26.11 


3.74 


9.15 


0.47 


118.24 


North Dakota 


96.63 


49.79 


88.07 


1.06 


5.49 


7.40 


0.46 


11955 


50.16 


84.92 


1.31 


6.13 


7.39 


0.47 


11957 


Pennsylvania 


94.13 


49.86 


76.47 


13.94 


6.32 


4.77 


0.79 


11851 


4951 


77.29 


1359 


5.80 


4.75 


0.82 


11850 


Rhode Island 


94.70 


49.15 


80.22 


5.83 


9.12 


8.28 


2.26 


117.14 


48.96 


80.90 


5.63 


9.01 


8.14 


2.36 


117.12 


South Carolina 


96.39 


50.78 


53.10 


36.72 


7.48 


6.69 


0.28 


118.22 


50.65 


54.75 


35.14 


7.30 


6.62 


0.29 


118.23 


Tennessee 


95 63 


49.48 


74.26 


19.73 


4.00 


7.37 


0.07 


119.38 


49.12 


74.26 


19.73 


4.00 


6.78 


0.08 


119.37 


Texas 


96.45 


49.86 


50.39 


12.06 


33.91 


6.89 


8.64 


119.78 


49.91 


50.39 


12.06 


33.91 


6.90 


8.74 


119.79 


Utah 


94.82 


50.74 


82.05 


0.65 


11.60 


6.60 


0.88 


118.00 


5054 


82.05 


0.65 


11.60 


6.37 


0.90 


117.98 


Virginia 


94.65 


50.16 


5957 


28.60 


7.18 


6.23 


0.78 


11756 


50.28 


60.83 


2752 


6.95 


6.17 


0.80 


11754 


Washington 


94.45 


51.93 


73.30 


4.94 


11.29 


7.72 


2.21 


118.80 


52.17 


73.30 


4.94 


11.29 


7.74 


2.30 


118.79 


West Virginia 


95 88 


5087 


90 75 


3.07 


3.91 


5.35 


0.11 


119.28 


50.60 


9054 


3.19 


3.94 


5.39 


0.11 


119.28 


Wisconsin 


96.34 


49.19 


83.45 


4.74 


7.30 


4.61 


1.67 


119.25 


48.97 


84.43 


452 


6.91 


454 


1.74 


119.24 


Wyoming 


95.92 


51.01 


82.01 


0.83 


12.21 


7.31 


0.31 


119.66 


50.85 


82.01 


0.83 


12.21 


7.31 


0.29 


119.68 



1G< 



16S 



146 



O 

ERIC 



Hispanic), Individualized Education Program (lEP) Status, and Limited English Proficient 
(LEP) Status for the full sample of students (after student exclusion) and for the assessed 
sample. The mean student age in months is also presented on each basis. Table 7-6 shows 
these results for aU students, public and nonpublic, in those states having adequate school 
response rates to permit reporting of combined results for public and nonpubUc students. 

The weight used for the full sample is the adjusted student base weight, defined in 
section 7.3.5. The weight for the assessed students is the final student weight, also defined in 
section 7.3.5. The difference between the estimates of the population subgroups is an estimate 
of the bias in estimating the size of the subgroup, resulting from student absenteeism. 

Care must be taken in interpreting these results, however. First, note that there is 
generally very little difference in the proportions estimated from the full sample and those 
estimated from the assessed students. While this is encouraging, it does not eliminate the 
possibility that bias exists, either within the state as a whole, or for results for gender and 
race/ethnicity subgroups, or for other subgroups. Second, on the other hand, where differences 
do exist they cannot be used to indicate the likely magnitude or direction of the bias with any 
reliability. For example, in Table 7-5, for New York the percentages of Black and Hispanic 
students in the full sample are respectively 20.62 and 19.21 percent. For assessed students, these 
percentages are 19.59 for Black students and 17.98 for Hispanic students. While these 
differences raise the possibility that some bias exists, it is not appropriate to speculate on the 
magnitude of this bias by considering the assessment results for Black and Hispanic students, in 
comparison to other students in the state. This is because the underrepresented Black and 
Hispanic students may not be typical of students that were included in the sample, and similarly 
those students within the same racial/ethnic groups who are disproportionately overrepresented 
may not be typical either. This is because not all students within the same race/ ethnicity group 
receive the same student nonresponse adjustment. 

One other feature to note is that, for assessed students, information as to the student’s 
gender and race/ethnicity is provided by the student, while for absent students this information 
is provided by the school. Evidence from past NAEP assessments (see, for example. Rust & 
Johnson, 1992) indicates that there can be substantial discrepancies between those two sources, 
especiaUy with regard to classifying grade 4 students as Hispanic. 



7J VARIATION IN WEIGHTS 

After computation of full-sample weights, an analysis was conducted on the distribution 
of the final student weights in each state. The analysis was intended to (1) check that the 
various weight components had been derived properly in each state, and (2) examine the impact 
of the variability of the sample weights on the precision of the sample estimates, both for the 
state as a whole and for major subgroups within the state. 

The analysis was conducted by looking at the distribution of the final student weights for 
the assessed students in each state and for subgroups defined by age, sex, race, level of 
urbanization, and level of parents’ education. Two key aspects of the distribution were 
considered in each case: the coefficient of variation (equivalently, the relative variance) of the 



147 



16 J 



Table 7-6 

Weighted Student Percentages Derived from All Schools Sampled 



Jurisdiction 


Weighted 
Student 
Participa- 
tion (%) 




We 


igfated Ettimatei Derived finom Full Sample 


Weighted Estimates Derived from Assessed Sample, with Student Nonresponse 
Adjustment 


Percent 

Male 


Percent 

White 


Percent 

Black 


Percent 

Hispanic 


Percent 

lEP 


Percent 

LEP 


Mean Age 
(Months) 


Percent 

Male 


Percent 

White 


Percent 

Black 


Percent 

Hispanic 


Percent 

lEP 


Percent 

LEP 


Mean Age 
(Months) 


Alabama 


95.99 


5056 


62.35 


28.89 


5.76 


554 


0.13 


119.20 


5059 


63.32 


27.86 


5.74 


5.46 


0.09 


119.20 


Arkansas 


95.90 


50.08 


70.16 


20.69 


6.35 


5.97 


0.34 


119.11 


49.98 


70.75 


19.67 


6.42 


5.86 


056 


119.12 


Colorado 


94.24 


4954 


67.44 


4.81 


21.46 


5.68 


2.28 


118.64 


50.17 


68.67 


453 


20.60 


558 


2.26 


118.63 


Connecticut 


95.58 


4959 


70.36 


11.88 


13.86 


7.72 


1.28 


116.81 


4956 


72.42 


10.80 


13.24 


7.72 


1.32 


116.81 


Delaware 


95 84 


49.18 


63.12 


23.46 


9.47 


7.84 


0.49 


116.42 


48.93 


66.98 


20.43 


8.88 


7.78 


052 


116.40 


Flonda 


94.28 


49.33 


5655 


21.31 


19.10 


9.10 


3.27 


119.46 


49.25 


58.68 


1952 


18.33 


8.95 


3.27 


119.49 


Georgia 


95.52 


47.77 


55.49 


32.25 


8.75 


4.95 


0.95 


119.76 


47.74 


5756 


30.41 


852 


5.04 


0.88 


119.76 


Guam 


96.19 


50.23 


9.37 


3.64 


17.83 


0.41 


2.60 


115.33 


50.76 


9.26 


3.34 


16.91 


0.42 


2.48 


115.32 


Hawaii 


9554 


49.88 


1659 


3.13 


21.02 


359 


3.15 


114.94 


4957 


18.00 


2.79 


19.44 


3.45 


3.26 


114.94 


Indiana 


%.12 


49.71 


80.71 


10.15 


6.68 


5.72 


0.31 


12053 


49.48 


80.18 


1059 


6.74 


5.76 


0.25 


12051 


Iowa 


95.94 


50.92 


88.19 


2.66 


5.96 


6.37 


0.17 


119.80 


51.00 


88.39 


254 


6.16 


6.14 


0.18 


119.80 


Kentucky 


%.67 


50.77 


83.42 


9.91 


4.60 


3.73 


0.21 


118.79 


50.68 


83.78 


9.28 


4.80 


3.66 


0.23 


118.80 


Louisiana 


96 19 


48.74 


50.82 


38.43 


7.82 


4.36 


051 


120.04 


48.28 


5456 


34.87 


7.34 


4.26 


053 


119.99 


Maine 


94.33 


50 05 


91.72 


0.79 


456 


7.14 


0.17 


119.38 


50.00 


91.76 


0.79 


4.48 


7.08 


0.17 


119.39 


Massachusetts 


9550 


50.46 


76.71 


7.38 


10.87 


9.62 


1.29 


117.64 


49.86 


76.63 


752 


10.72 


9.41 


1.29 


117.64 


Minnesota 


9556 


50.45 


83.69 


2.94 


8.21 


6.46 


0.73 


119.34 


50.25 


84.72 


2.85 


7.68 


6.64 


0.74 


119.35 


Missouri 


94.87 


51.28 


75.09 


14.09 


6.90 


6.72 


0.05 


120.27 


51.44 


76.27 


12.94 


7.10 


6.65 


0.06 


120.24 


New Jersey 


95.37 


49.00 


60.39 


15.75 


17.28 


5.27 


1.38 


117.30 


48.91 


60.82 


15.13 


17.70 


5.10 


1.44 


117.27 


New Mexico 


94 46 


48.60 


41.11 


2.82 


44.28 


8.13 


3.10 


118.89 


48.60 


39.04 


2-97 


42-45 


8.08 


3.18 


118.86 


North Dakota 


%.27 


49.69 


87.85 


1.10 


5.64 


7.46 


1.30 


119.47 


50.00 


84.84 


157 


6.35 


7.37 


1.30 


119.47 


Pennsylvania 


94.17 


49.83 


76.23 


13.74 


6.62 


4.09 


0.99 


118.34 


4958 


77.06 


13.43 


6.08 


4.07 


1.00 


118.34 


RjModc Island 


94 89 


49.45 


79.85 


5.91 


9.38 


758 


121 


116.97 


49.32 


8050 


5.72 


9.27 


7.47 


251 


116.95 


Virginia 


94.71 


49.67 


59.17 


2853 


7.48 


5.87 


0.77 


11756 


49.83 


60.43 


27.45 


7.24 


5.81 


0.79 


11755 


West Virginia 


95.93 


50.99 


90.46 


3.12 


405 


5.14 


0.10 


119.23 


50.76 


90.30 


3.25 


4.08 


5.18 


0.10 


11924 



170 



148 



O 

ERIC 



weight distribution; and the presence of outliers — that is, cases whose weights were several 
standard deviations away from the median weight. 



It was important to examine the coefficient of variation of the weights because a large 
coefficient of variation reduces the effective size of the sample. Assuming that the variables of 
interest for individual students are uncorrelated with the weights of the students, the sampling 



variance of an estimated average or aggregate is approximately 



(1 + 



100 



times as great as 



the corresponding sampling variance based on a self-weighting sample of the same size, where C 
is the coefficient of variation of the weights expressed as a percent. Outliers, or cases with 
extreme weights, were examined because the presence of such an outlier was an indication of 
the possibility that an error was made in the weighting procedure, and because it was likely that 
a few extreme cases would contribute substantially to the size of the coefficient of variation. 



In most states, the coefficients of variation were 35 percent or less, both for the whole 



sample and for all major subgroups. This means that the quantity (1 + 



100 



) was generally 



below 1.1, and the variation in sampling weights had little impact on the precision of sample 
estimates. 



A few relatively large student weights were observed in one state. These extreme 
weights were for students in a school for which the grade enrollment available at the time of 
sample selection proved to be several-fold short of the actual enrollment. An evaluation was 
made of the impact of trimming these largest weights back to a level consistent with the largest 
remaining weights found in the state. Such a procedure produced an appreciable reduction in 
the size of the coefficient of variation for the weights in this state, and hence this trimming was 
implemented in that state. We judged that this procedure had minimal potential to introduce 
bias, while the reduction in the coefficient of variation of the weights gives rise to an appreciable 
decrease in sampling error for the state. 



7.6 CALCULATION OF REPLICATE WEIGHTS 

A replication method known as jackknife was used to estimate the variance of statistics 
derived from the full sample. The process of replication involves repeatedly selecting portions 
of the sample (replicates) and calculating the desired statistic (replicate estimates). The 
variability among the calculated replicate estimates is then used to obtain the variance of the 
full-sample estimate. 

In each state, replicates were formed in two steps. First, each school was assigned to 
one of a maximum of 62 replicate groups, each group containing at least one school. In the next 
step, a random subset of schools (or, in some cases, students within schools) in each replicate 
group was excluded. The remaining subset and all schools in the other replicate groups then 
constituted one of the 62 replicates. The process of forming these replicate groups, core to the 
process of variance estimation, is described below. 

149 




172 



7.6.1 Defining Replicate Groups and Forming Replicates for Variance Estimation 

Replicate groups were formed separately for public and nonpublic schools. Once 
replicate groups were formed for all schools, students were then assigned to their respective 
school replicate groups. 

Public Schools. These schools were sorted according to the state, monitoring status, and, 
within monitoring status, the order in which they were selected from the sampling frame. The 
schools were then were grouped in pairs. Where there was an odd number of schools, the last 
replicate group contained three schools instead of two. The pairing was done such that no 
single pair contained schools with different monitoring status. In those states where the number 
of pairs exceeded 62 (Montana and Nebraska), the pair numbering proceeded up to 62, and 
then decreased back from 62 for the last few pairs. 

Each of the certainty public schools (excluding those in Guam and the District of 
Columbia) was assigned to a single replicate group of its own. Here, schools were sorted by the 
estimated grade enrollment prior to group assignments. Again, depending on the state, a 
maximum of 62 certainty groups was formed. The group numbering resumed from the last 
group number used for the noncertainty schools if the total number of public school groups was 
less than 62. Otherwise, the numbering started from 62 down to the number needed for the last 
certainty public school. In the District of Columbia, which had only 117 certainty schools (no 
noncertainty schools), groups started at 1 and continued up to 62 and then back down to 8. 

The purpose of this scheme was to assign as many replicates to a state’s public schools as 
permitted by the design, to a maximum of 62. When more than 62 replicates were assigned, the 
procedure ensured that no subset of the replicate groups (pairs of noncertainty schools, 
individual certainty schools, or groups of these) was substantially larger than the other replicate 
groups. The aim was to maximize the degrees of freedom available for estimating variances for 
public-school data. 

A single replicate was formed by dropping one member of a given pair. This process 
was repeated successively across pairs, giving up to 62 replicates. 

Nonpublic Schools. Replicate groups for noncertainty nonpublic schools were formed in 
one of the two methods described below. If any of the following conditions was true for a given 
state, then the subsequent steps were taken to form replicate groups. Here, the numbering 
started at 62 down to the last needed number. 

Conditions for Method 1: 

• fewer than 1 1 nonpublic noncertainty schools; 

• fewer than 2 Catholic noncertainty schools; or 

• fewer than 2 nonCatholic noncertainty schools. 



150 




O 

ERIC 



Steps for Method 1: 

• all schools were gi ouped into a single replicate group; 

• schools were randomly sorted; and 

• starting with the second school, replicates were formed by consecutively leaving 
out one of the remaining n - / schools; each replicate included the first school. 

When a given state did not match conditions of the first method, i.e., when all of the 
following conditions were true, then the preceding steps were repeated separately for two 
replicate groups, one consisting of Catholic schools and on consisting of nonCatholic schools. 

Conditions for Method 2: 

• more than 10 nonpublic noncertainty schools; 

• more than 1 Catholic noncertainty school; and 

• more than 1 nonCatholic noncertainty school. 

For states with certainty nonpublic schools (Delaware, District of Columbia, and Hawaii) 
each school was assigned to a single group. Prior to this assignment, schools were sorted in 
descending order of the estimated grade enrollment. The group numbering started at the last 
number where the noncertainty nonpublic schools ended. A replicate was formed by randomly 
deleting one half of the students in a certain school from the sample. This was repeated for 
each certainty school. 

Again, the aim was to maximize the number of degrees of freedom for estimating 
sampling errors for nonpubUc schools (and indeed for pubUc and nonpublic schools combined) 
within the constraint of forming 62 replicate groups. Where a state had a significant 
contribution from both Catholic and nonCatholic schools, we ensured that the sampling error 
estimates reflected the stratification on this characteristic. 

Guam. For Guam schools, the number of half-groups per school were obtained based on 
the number of students. For public schools, if the numbers of students were less than 60, 
between 60 and 119, and over 119, then the number of half-groups per school were set to 2, 4, 
and 6, respectively. For nonpublic schools, the limits were set to less than 70, between 70 and 
119, and over 119. 



7.6J1 School-level Replicate Weights 

As mentioned above, each replicate sample had to be reweighted to compensate for the 
dropped unit(s) defining the replicate. This reweighting was done in two stages. At the 
first-stage, the ith school included in a particular replicate r was assigned a replicate-specific 
school base weight defined as 



151 



171 



o 




^ K. 






where is the full-sample base weight for school /, and, for public schools 



K. = 



1.5 if school i was contained in a "pair" consisting of 3 units from which 
the complementary member was dropped to form replicate r, 

2 if school i was contained in a pair consisting of 2 units from which the 
complementary member was dropped to form replicate r, 

0 if school i was dropped to form replicate r, and 

1 otherwise. 



For private schools, Method 1: 



... ( 



if school i was not dropped in forming replicate r 
0 if school i was dropped to form replicate r 



For private schools. Method 2 (with n, Catholic schools and nonCatholic schools): 

rtj 

if school i was Catholic, not dropped from replicate r, and 
> replicate r was formed by dropping a Catholic school 



K = 



1 if school i was Catholic and replicate r was formed by dropping a 
nonCatholic school 

, if school i was nonCatholic, not dropped from replicate r, and replicate r 
^ was formed by dropping a nonCatholic school 



1 if school i was nonCatholic and replicate r was formed by dropping a 
Catholic school 



0 if school i was dropped to form replicate r 



Using the replicate-specific school base weights, W the school-level nonresponse 
weighting adjustments were recalculated for each replicate r. That is, the school-level 
nonresponse adjustment factor for schools in replicate r and adjustment class k was computed as 



152 




17a 



where 



E ‘■'C X £.) 

feCj 

teCt 



Q = the subset of school records in adjustment class k; 

= the replicate-r base weight of the ith school in class k; 

= the QED grade enrollment for the ith school in class k; 

In the above formulation, the indicator variable 5^ had a nonzero value only when the ith school 
in replicate r and adjustment class k participated" in the assessment. The replicate-specific 
nonresponse-adjusted school weight for the ith school in replicate r in class k was then computed 

as 

< = X X 6^ . 



7.6J Student-level Replicate Weights 

The replicate-specific adjusted student base weights were calculated by multiplying the 
replicate-specific adjusted school weights as described above by the corresponding within-school 
student weights. That is, the adjusted student base weight for thejth student in adjustment class 
k in replicate r was initially computed as 



where 

= the nonresponse-adjusted school weight for school i in school adjustment 
class k and replicate r, and 

= the within-school weight for the ;th student in school i. 

The final replicate-specific student weights were then obtained by applying the student 
nonresponse adjustment procedures to each set of replicate student weights. Let F* denote the 
student-level nonresponse adjustment factor for replicate r and adjustment class k. The final 
replicate-r student weight for student j in school i in adjustment class k was calculated as: 

X iC X ■ 



17D 






153 



Finally, estimates of the variance of sample*based estimates were calculated as 



= E (Jfr - . 

r-1 



where 

•* = E X X 

i’J 

denote an estimated total based on the full sample, and x, denote the corresponding estimate 
based on replicate r with 62 replicates. The standard error of an estimate x is estimated by 
taking the square root of the estimated variance, Varn((x). 



154 

17 / 




Chapter 8 



THEORETICAL BACKGROUND AND PHILOSOPHY OF 
NAEP SCALING PROCEDURES 



Eugene G. Johnson, Robert J. Mislevy, and Neal Thomas 
Educational Testing Service 



8.1 OVERVIEW 

The primary method by which results from the Trial State Assessment are disseminated 
is scale-score reporting. With scaling methods, the performance of a sample of students in a 
subject area or subarea can be summarized on a single scale or a series of scales even when 
different students have been administered different items. This chapter presents an overview of 
the scaling methodologies employed in the analyses of the data from NAEP surveys in general 
and from the Trial State Assessment in reading in particular. Details of the scaling procedures 
specific to the Trial State Assessment are presented in Chapter 9. 



82 BACKGROUND 

The basic information from an assessment consists of the responses of students to the 
items presented in the assessment. For NAEP, these items are constructed to measure 
performance on sets of objectives developed by nationally representative panels of learning area 
specialists, educators, and concerned citizens. Satisfying the objectives of the assessment and 
ensuring that the tasks selected to measure each goal cover a range of difficulty levels typically 
requires many items. For example, the Trial State Assessment in reading required 84 items at 
grade 4. To reduce student burden, each assessed student was presented only a fraction of the 
full pool of items through multiple matrix sampling procedures. 

The most direct manner of presenting the assessment results is to report separate 
statistics for each item. However, because of the vast amount of information, having separate 
results for each of the items in the assessment pool hinders the comparison of the general 
performance jf subgroups of the population. Item-by-item reporting masks similarities in trends 
and subgroup comparisons that are common across items. 

An obvious summary of performance across a collection of items is the average of the 
separate item scores. The advantage of averaging is that it tends to cancel out the effects of 
peculiarities in items that can affect item difficulty in unpredictable ways. Furthermore, 
averaging makes it possible to compare snore easily the general performances of subpopulations. 






o 

ERIC 






155 



Despite their advantages, there are a number of significant problems with average item 
scores. First, the interpretation of these results depends on the selection of the items; the 
selection of easy or difficult items could make student performance appear to be overly high or 
low. Second, the average score is related to the particular items comprising the average, so that 
direct comparisons in performance between subpopulations require that those subpopulations 
have been administered the same set of items. iTiird, because this approach limits comparisons 
to average scores on specific sets of items, it provides no simple way to report trends over time 
when the item pool changes. Finally, direct estimates of parameters or quantities such as the 
proportion of students who would achieve a certain score across the items in the pool are not 
possible when every student is administered only a fraction of the item pool. While the mean 
average score across all items in the pool can be readily obtained (as the average of the 
individual item scores), statistics that provide distributional information, such as quantiles of the 
distribution of scores across the full set of items, cannot be readily obtained without additional 
assumptions. 

These limitations can be overcome by the use of response scaling methods. If several 
items require similar skills, the regularities observed in response patterns can often be exploited 
to characterize both respondents and items in terms of a relatively small number of variables. 
These variables include a respondent-specific variable, called proficiency, which quantifies a 
respondent’s tendency’ to answer items correctly (or, for multipoint items, to achieve a certain 
score) and item-specific variables that indicate characteristics of the item such as its difficulty, 
effectiveness in distinguishing between individuals with different levels of proficiency, and the 
chances of a very low proficiency respondent correctly answering a multiple-choice item. (These 
variables are discussed in more detail in the next section). When combined through appropriate 
mathematical formulas, these variables capture the dominant features of the data. Furthermore, 
all students can be placed on a common scale, even though none of the respondents takes all of 
the items within the pool. Using the common scale, it becomes possible to discuss distributions 
of proficiency in a population or subpopulation and to estimate the relationships between 
proficiency and background variables. 

It is important to point out that any procedure of aggregr .on, from a simple average to 
a complex multidimensional scaling model, highlights certain patterns at the expense of other 
potentially interesting patterns may reside within the data. Every item in a NAEP survey is 
of interest and can provide usel- information about what young Americans know and can do. 
The choice of an aggregation procedure must be driven by a conception of just which patterns 
are salient for a particular purpose. 

The scaling for the Trial State Assessment in reading was carried out separately within 
the two reading content areas specified in the framework for grade 4 reading. This scaling 
within subareas was done because it was anticipated that different patterns of performance 
might exist for these essential subdivisions of the subject area. The two content area scales 
correspond with two purposes of reading — Reading for Literary Experience and Reading to 
Gain Information. By creating a separate scale for each of these content areas, potential 
differences in subpopulation performance between the content areas are preserved. 

The creation of a series of separate scales to describe reading performance does not 
preclude the reporting of a single index of overall reading performance — that is, an overall 
reading composite. A composite is computed as the weighted average of the two content area 

136 

17'J 




scales, where the weights correspoiiJ to the relative importance given to each content area as 
defined by the framework. The composite provides a ^obal measure of performance within the 
subject area, while the constituent content area scales allow the measurement of important 
interactions within educationally relevant subdivisions of the subject area. 



8J SCALING METHODOLOGY 

This section reviews the scaling models employed in the analyses of data from the Trial 
State Assessment in reading and the 1994 national reading assessment, and the multiple 
imputation or "plausible values" methodology that allows such models to be used with NAEP’s 
sparse item-sampling design. The reader is referred to Mislevy (1991) for an introduction to 
plausible values methods and a comparison with standard psychometric analyses, to Mislevy, 
Johnson and Muraki (1992) and Beaton and Johnson (1992) for additional information on how 
the models are used in NAEP, and to Rubin (1987) for the theoretical underpinnings of the 
approach. It should be noted that the imputation procedure used by NAEP is a mechanism for 
providing plausible values for proficiencies and not for filling in blank responses to background 
or cognitive variables. 

While the NAEP procedures were developed explicitly to handle the characteristics of 
NAEP data, they build on other research, and are paralleled by other researchers. See, for 
example Dempster, Laird, and Rubin (1977); Little and Rubin (1983, 1987); Andersen (1980); 
Engelen (1987); Hoijtink (1991); Laird (1978); Lindsey, Clogg, and Grego (1991); Zwinderman 
(1991); Tanner and Wong (1987); and Rubin (1987, 1991). 

The 84 reading items administered at grade 4 in the Trial State Assessment were also 
administered to fourth-grade students in the national reading assessment. However, because the 
administration procedures differed, the Trial State Assessment data were scaled independently 
from the national data. The national data also included results for students in grades 8 and 12. 
Details of the scaling of the Trial State Assessment and the subsequent linking to the results 
from the national reading assessment are provided in Chapter 9. 



8J.1 The Scaling Models 

Three distinct scaling models, depending on item type and scoring procedure, were used 
in the analysis of the data from the Trial State Assessment. Each of the models is based on 
item response theory (IRT; e.g.. Lord, 1980). Each is a "latent variable" model, defined 
separately for each of the scales, which express respondents’ tendencies to achieve certain scores 
(such as correct/incorrect) on the items contributing to a scale as a function of a parameter that 
is not directly observed, called proficiency on the scale. 



A three-parameter logistic (3PL) model was used for the multiple-choice items (which 
were scored correct/incorrect). The fundamental equation of the 3PL model is the probability 
that a person whose proficiency on scale k is characterized by the unobservable variable 0* will 
respond correctly to item j: 

P{Xj = l\e^,aj,bj,cp ^ ^ 



* W 

where 

Xj is the response to item /, 1 if correct and 0 if not; 

fly where «y>0, is the slope parameter of item j, characterizing its sensitivity 

to proficiency; 

bj is the threshold parameter of item j, characterizing its difficulty; and 

Cj where 0^Cj< 1, is the lower asymptote parameter of item j, reflecting the 

chances of students of very low proficiency selecting the correct option. 

Further define the probability of an incorrect response to the item as 

Pj, = P{Xj = = 1 - Pj,{Q,) ( 8 . 2 ) 

A two-parameter logistic (2PL) model was used for short constructed-response items, 
which were scored correct or incorrect. The form of the 2PL model is the same as equations 
(8.1) and (8.2) with the Cj parameter fixed at zero. 

Thirty-nine multiple-choice and 45 constructed-response items were presented in the 
Trial State and grade 4 national assessments. Of the latter, 37 were short constructed-response 
items, nine of which were scored on a three-point scale and 28 of which were dichotomously 
scored. The remaining eight constructed-response items were scored on a five-point scale with 
potential scores ranging from 0 to 4. Items that are scored on a multipoint scale are referred to 
as polytomous items, in contrast with the multiple-choice and short constructed-response items, 
which are scored correct/incorrect and referred to as dichotomous items. 

The polytomous items were scaled using a generalized partial credit model (Muraki, 
1992). The fundamental equation of this model is the probability that a person with proficiency 
on scale k will have, for the ;th item, a response Xj that is scored in the /th of mj ordered score 
categories: 



exp[-1.7fl^ (8j - b^] 



( 8 . 1 ) 



s 






«p(E i-7‘>a-*/*4..> 

vO 

E (E 

*-0 ¥-0 



where 



^/ 0 *) 



(8.3) 



my is the number of categories in the response to item j 

Xj is the response to item with possibilities 0,1,.. ./n-1 

Oj is the slope parameter; 

bj is the item location parameter characterizing overall difficulty; and 

djj is the category i threshold parameter (see below). 

Indeterminaciei in the parameters of the above model are resolved by setting djg = 0 and 
setting 



Mj-l 

E 4. = »■ 

i-i 

Muralci (1992) points out that bj - dj, is the point on the 0* scale at which the plots of Pj_,.t(diJ 
and P/dJ intersect and so characterizes the point on the dt scale above which the category i 
response to item j has the highest probability of incurring a change from response category t-1 
to i. 



When nij = 2, so that there are two score categories (0,1), it can be shown that P/OiJ of 
equation 8.3 for i = 0,l corresponds respectively to PjJ[6^ and Pj,(6iJ of the 2PL model (equations 
8.1 and 8.2 with Cy=0). 

A typical assumption of item response theory is the conditional independence of the 
response by an individual to a set of items, given the individual’s proficiency. That is, 
conditional on the individual’s 0^ the joint probability of a particular response pattern 
i across a set of n items is simply the product of terms based on (8.1), (8.2), and 

(8.3): 



(8.4) 



a *V** 

parameters) = n n W' 

;-i 1-0 



where P/dJ is of the form appropriate to the type of item (dichotomous or polytomous), mj is 
taken equal to 2 for the dichotomously scored items, and Ujj is an indicator variable defined by 



1 if response Xj was in category i 
0 otherwise. 



It is also typically assumed that response probabilities are conditionally independent of 
background variables (y), given 6^, or 

POi\di^,item parameters,^) = pQ;,\Q^^,item parameters) (8.5) 



After X has been observed, equation 8.4 can be viewed as a likelihood function, and 
provides a basis for inference about 6* or about item parameters. Estimates of item parameters 
were obtained by the NAEP BILOG/PARSCALE program, which combines Mislevy and Bock’s 
(1982) BILOG and Muraki and Bock’s (1991) PARSCALE computer programs, and which 
concurrently estimates parameters for all items (dichotomous and polytomous). The item 
parameters are then treated as known in subsequent calculations. The parameters of the items 
constituting each of the separate scales were estimated independently of the parameters of the 
other scales. Once items have been calibrated in this manner, a likelihood function for the scale 
proficiency 6^ is induced by a vector of responses to any subset of calibrated items, thus allowing 
8t-based inferences from matrix samples. 

In all NAEP IRT analyses, missing responses at the end of each block of items a student 
was administered were considered "not-reached," and treated as if they had not been presented 
to the respondent. Missing responses to dichotomous items before the last observed response in 
a block were considered intentional omissions, and treated as fractionally correct at the value of 
the reciprocal of the number of response alternatives. These conventions are discussed by 
Mislevy and Wu (1988). With regard to the handling of not-reached items, Mislevy and Wu 
found that ignoring not-reached items introduces slight biases into item parameter estimation to 
the degree that not-reached items are present and speed is correlated with ability. With regard 
to omissions, they found that the method described above provides consistent limited- 
information likelihood estimates of item and ability parameters under the assumption that 
respondents omit only if they can do no better than responding randomly. 

Although the IRT models are employed in NAEP only to summarize performance, a 
number of checks are made to detect serious violations of the assumptions underlying the 
models (such as conditional independence). When warranted, remedial efforts are made to 
mitigate the effects of such violations on inferences. These checks include comparisons of 
empirical and theoretical item response functions to identify items for which the IRT model may 
provide a poor fit to the data. 



Scaling areas in NAEP are determined a priori by grouping items into content areas for 
which overall performance is deemed to be of interest, as defined by the frameworks developed 
by the National Assessment Governing Board. A proficiency scale d^, is defined a priori by the 
collection of items representing that scale. What is important, therefore, is that the models 
capture salient information in the response data to effectively summarize the overall 
performance on the content area of the populations and subpopulations being assessed in the 
content area. NAEP routinely conducts differential item functioning (DIF) analyses to guard 
against potential biases in making subpopulation comparisons based on the proficiency 
distributions. 

The local independence assumption embodied in equation 8.4 implies that item response 
probabilities depend only on 6 and the specified item parameters, and not on the position of the 
item in the booklet, the content of items around an item of interest, or the test-administration 
and timing conditions. However, these effects are certainly present in any application. The 
practical question is whether inferences based on the IRT probabilities obtained via 8.4 are 
robust with respect to the ideal assumptions underlying the IRT model. Our experience with the 
1986 NAEP reading anomaly (Beaton & Zwick, 1990) has shown that for measuring small 
changes over time, changes in item context and speededness conditions can lead to unacceptably 
large random error components. These can be avoided by presenting items used to measure 
change in identical test forms, with identical timings and administration conditions. Thus, we do 
not maintain that the item parameter estimates obtained in any particular booklet configuration 
are appropriate for other conceivable configurations. Rather, we assume that the parameter 
estimates are context-bound. (For this reason, we prefer common population equating to 
common item equating whenever equivalent random samples are available for linking.) This is 
the reason that the data from the Trial State Assessment were calibrated separately from the 
data from the national NAJBP— since the administration procedures differed somewhat between 
the Trial State Assessment and the national NAEP, the values of the item parameters could be 
different. Chapter 9 provides details on the procedures used to link the results of the 1994 1'rial 
State Assessment to those of the 1994 national assessment. 



8JJI An Overview of Plausible Values Methodology 

Item response theory was developed in the context of measuring individual examinees' 
abilities. In that setting, each individual is administered enough items (often^60 or more) to 
permit precise estimation of his or her d, as a maximum likelihood estimate 6, for example. 
Because the uncertainty associated with each d is negligible, the distribution of 6, or the joint 
distribution of 6 with other variables, can then be approximated using individuals' 6 values as if 
they were d values. 

This approach breaks down in the assessment setting when, in order to provide broader 
content coverage in limited testing time, each respondent is administered relatively few items in 
a scaling area. The problem is that the uncertainty associated with individual 6s is too large to 
ignore, and the features of the ^ distribution can be seriously biased as estimates of the 6 
distribution. (The failure of this approach was verifietl in early analyses of the 1984 NAHP 
reading survey; see Wingersky, Kaplan, & Beaton, 1987.) Plausible values were developed as a 
way to estimate key population features consistently, and appro' imate others no worse than 
standard IRT procedures would. A detailed development of plausible values methodology is 



given in Mislevy (1991). Along with theoretical justifications, that paper presents comparisons 
with standard procedures, discussions of biases that arise in some secondary analyses, and 
numerical examples. 

The following provides a brief overview of the plausible values approach, focusing on its 
implementation in the Trial State Assessment analyses. 

Let represent the responses of all sampled examinees to background and attitude 
questions, along with design variables such as school membership, and let i represent the vector 
of scale proficiency values. If were known for all sampled examinees, it would be possible to 
compute a statistic t($ ,y ) — such as a scale or composite subpopulation sample mean, a sample 
percentile point, or a sample regression coefficient — to estimate a corresponding population 
quantity T. A function U(S,y) — e.g., a jackknife estimate — would be used to gauge sampling 
uncertainty, as the variance of t around T in repeated samples from the population. 

Because the scaling models are latent variable models, however, 2 values are not 
observed even for sampled students. To overcome this problem, we follow Rubin (1987) by 
considering 2 as "missing data" and approximate t(gy) by its expectation given (x,yj, the data that 
actually were observed, as follows: 

= j t(S,yJ di . (8.6) 

It is possible to approximate t’ using random draws from the conditional distribution of 
the scale proficiencies given the item responses background variables and model 
parameters for sampled student /. These values are referred to as imputations in the sampling 
literature, and plausible values in NAEP. The value of 2 for any respondent that would enter 
into the computation of t is thus replaced by a randomly selected value from the respondent’s 
conditional distribution. Rubin (1987) proposes that this process be carried out several 
times — multiple imputations— «o that the uncertainty associated with imputation can be 
quantified. The average of the results of, for example, M estimates of t, each computed from a 
different set of plausible values, is a Monte Carlo approximation of (8.6); the variance among 
them, B, reflects uncertainty due to not observing 2, and must be added to the estimated 
expectation of which reflects uncertainty due to testing only a sample of students from 

the population. Section 8.5 explains how plausible values are used in subsequent analyses. 

It cannot be emphasized too strongly that plausible values are not test scores for 
individuals in the unual sense. Plausible values are offered only as intermediary computations 
for calculating integrals of the form of equation 8.6, in order to estimate population 
characteristics. "When the underlying model is correctly specified, plausible values will provide 
consistent estimates of population characteristics, even though they are not generally unbiased 
estimates of the proficiencies of the individuals with whom they are associated. The key idea 
lies in a contrast between plausible values and the more familiar 6 estimates of educational 
measurement that are in some sense optimal for each examinee (e.g., maximum likelihood 
estimates, which are consistent estimates of an examinee's 0, and Bayes estimates, which provide 
minimum mean-squared errors with respect to a reference population); Point estimates that are 




optimal for individual examinees have distributions that can produce decidedly nonoptimal 
(specifically, inconsistent) estimates of population characteristics (Little & Rubin, 1983). Plausible 
vdues, on the other hand, are constructed explicitly to provide consistent estimates of 
population effects. For further discussion see Mislevy, Beaton, Kaplan, and Sheehan (1992). 



8J3 Computing Plausible Values in IRT-based Scales 

Plausible values for each respondent i are drawn from the conditional distribution 
p(^\x,y^V,H), where P and E are regression model parameters defined in this subsection. This 
subsection describes how, in IRT-based scales, these conditional distributions are characterized, 
and how the draws are taken. An application of Bayes’ theorem with the IRT assumption of 
conditional independence produces 

p(£lx,y,r,E) « P(x\^y,V,Z) p(l\y,V,Z) ^ P(x,\^ p(Uy,T,i:) , (8.7) 

where, for vector-valued 2,, P(x,\^ is the product over scales of the independent likelihoods 
induced by responses to items within each scale, and p(^\y,V,Z) is the multivariate — and 
generally nonindependent — joint density of proficiencies for the scales, conditional on the 
observed value y,- of background responses, and the parameters P and E. The scales are 
determined by the item parameter estimates that constrain the population mean to zero and 
standard deviation to one. The item parameter estimates are fixed and regarded as population 
values in the computation described in this subsection. 

In the analyses of the data from the Trial State Assessment and the data from the 
national reading assessment, a normal (Gaussian) form was assumed forp(g^|y,P,E), with a 
common variance-covariance matrix, E, and with a mean given by a linear model with slope 
parameters, P, based on the first 134 to 200 principal components of 482 selected main effects 
and two-way interactions of the complete ector of background variables. The included principal 
components will be referred to as the conditioning variables, and will be denoted y. (The 
complete set of original background variables used in the Trial State Assessment reading 
analyses are listed in Appendix C.) The following model was fit to the data within each state; 

fi = P’/ + £, (8.8) 

where £ is multivariately normally distributed with mean zero and variance-covariance matrix E. 
The number of principal components of the conditioning variables used for each state was 
sufficient to account for 90 percent of the total variance of the full set of conditioning variables 
(after standardizing each variable). As in regression analysis, P is a matrix each of whose 
columns is the effects for one scale and E is the matrbe variance-covariance of residuals between 
scales. By fitting the model (8.8) separately within each state, interactions between each state 
and the conditioning variables are automatically included in the conditional joint density of scale 
proficiencies. 

Maximum likelihood estimates of P and E. denoted by f and t, are obtained fror 
Sheehan’s (1985) MGROUP computer program using the EM algorithm described in Mislevy 



(1985), Tlie EM algorithm requires the computation of the mean, 6,, and variance, E-’, of the 
posterior distribution in (8.7). These moments are computed using higher order asymptotic 
corrections (Thomas, 1992). 

After completion of the EM algorit*'m, the plausible values are drawn in a three-step 
process from the joint distribution of the values of F for all sampled respondents. First, a value 

of r is drawn from a normal approximation to P(r.E|x,y) that fixes E at the value E, (Thomas, 
1992). Second, conditional on the generated value of F (and the fixed value of E = E), the 
mean, 6 , and variance, Ef, of the posterior distribution in equation 8.7 (i.e., p(^|x,y<,F,E)) are 

t 

computed using the same methods applied in the EM algorithm. In the third step, the 6. are 

drawn independently from a multivariate normal distribution with mean ff, and variance El", 
approximating the distribution in (8.7). These three steps are repeated five times producing five 
imputations of £ for each sampled respondent. 



8.4 .N.\GB ACHIEVT.MENT LEVTLS 

Since its beginning, a goal of NAEP has been to inform the public about what students in 
.American schools know and can do. While the NAEP scales provide information about the 
distributions of proficiency for the various subpopulations, they do not directly provide 
information about the meaning of various points on the scale. Traditionally, meaning has been 
attached to educational scales by norm-referencing — that is, by comparing students at a 
particular scale level to other students. Beginning in 1990, NAEP reports have also presented 
data using achievement levels. The reading achievement levels were developed and adopted by 
the National Assessment Governing Board (NAGB), as authorized by the NAEP legislation. 

The achievement levels describe selected points on the scale in terms of the types of skills that 
are or should be exhibited by students scoring at that level. The achievement level process was 
applied to the 1992 national NAEP reading composite and the 1994 national scales were linked 
to the 1992 national scales. Since the Trial State Assessment scales were linked to the national 
scales in both years, the interpretations of the selected levels also apply to the Trial State 
Assessment in 1994. 

NAGB has determined that achievement levels shall be the first and primary way of 
reporting NAEP results. Setting achievement levels is a method for setting standards on the 
NAEP assessment that identify what students should know and be able to do at various points 
on the reading composite. For each grade in the national assessment and, here, for grade 4 in 
the Trial State Assessment, four levels were defined — basic, proficient, advanced, and the region 
below basic. Based on initial policy definitions of these levels, panelists were asked to determine 
operational descriptions of the levels appropriate with the content and skills assessed in the 
reading awessment. With these descriptions in mind, the panelists were then asked to rate the 
assessment items in terms of the expected performance of marginally acceptable examinees at 
each of these levels. These ratings were then mapped onto the NAEP scale to obtain the 
achievement level cutpoints for reporting. Further details of the achievement level-setting 
priKess appear in Appendix P'. 




ANALYSES 



When survey variables are observed without error from every respondent, standard 
variance estimators quantify the uncertainty associated with sample statistics from the only 
source of uncertainty, namely the sampling of respondents. Item-level statistics for NAEP 
cognitive items meet this requirement, but scale-score proficiency values do not. The IRT 
models used in their construction posit an unobservable proficiency variable 6 to summarize 
performance on the items in the subarea. The fact that 6 values are not observed even for the 
respondents in the sample requires additional statistical analyses to draw inferences about 6 
distributions and to quantify the uncertainty associated with those inferences. As described 
above, Rubin’s (1987) multiple imputations procedures were adapted to the context of latent 
variable models to produce the plausible values upon which many analyses of the data from the 
Trial State Assessment were based. This section describes how plausible values were employed 
in subsequent analyses to yield inferences about population and subpopulation distributions of 
proficiencies. 



8,5-1 Computational Procedures 

Even though one does not observe the 6 value of respondent i, one does observe 
variables that are related to it: the respondent’s answers to the cognitive items he or she was 

administered in the area of interest, andy^ the respondent’s answers to demographic and 
background variables. Suppose one wishes to draw inferences about a number T(6,Y) that could 
be calculated explicitly if the 6 and y values of each member of the population were known. 
Suppose further that if d values were observable, we would be able to estimate T from a sampie 
of N pairs of 0 andy values by the statistic t(Q,y) [where (Q,y) ■ and that we could 

estimate the variance in t around T due to sampling respondents by the function U(^y). Given 
that observations consist of (x,yj rather than (Q^yJ, we can approximate r by io expected value 
conditional on (x,y), or 

= E(t(8,y)\x,yJ = [ tdx) p(^\x,y) . 

It is possible to approximate f* with random draws from the conditional distributions 
p(Qj\XpyJ, which are obtained for all respondents by the method described in section 8.3.3. Let 
be the mth such vector of plausible values, consisting of a multidimensional value for the 
latent variable of each respondent. This ' ctor is a plausible representation of what the true 
vector might have been, had we been able to observe it. 

The following steps describe how an estimate of a scalar statistic t(ty) and its sampling 
variance can be obtained from ,Vf (> 1) such sets of plausible values. (Five sets of plausible 
values are used in NAEP analyses of the Trial State Assessment.) 

1) Using each set of plausible values ^ in turn, evaluate t as if the plausible values 
were true values of Q. Denote the results r.. for nt = 



U>5 



.ISo 



o 




2) 

3) 



Using the jackknife variance estimator defined in Chapter 7, compute the 
estimated sampling variance of denoting the result U^. 



The final estimate of t is 



t* - 52 — . 
fu M 



4) Compute the average sampling variance over the M sets of plausible values, to 
approximate uncertainty due to sampling respondents: 

" t/, 

= T, —■ 

ifi M 

5) Compute the variance among the M estimates "r,,, to approximate uncertainty due 
to not observing 6 values from respondents: 

B = E — - 

in (Af - 1) 



6) The final estimate of the variance of t* is the sum of two componeiits; 

V = y + (1 + AT‘) B. 



Note: Due to the excessive computation that would be required, NAEP analyses did not 
compute and average jackknife variances over all five sets of plausible values, but only on 
the first set. Thus, in NAEP reports, IT is approximated by IJ,. 



852 Statistical Tests 

Suppose that if 6 values were observed for sampled students, the statistic (t - T)/U‘'^ 
would follow a /-distribution with d degrees of freedom. Then the incomplete-data statistic 
(t* • T)/V''^ is approximately /-distributed, with degrees of freedom given by 



, (1 -ff 

Af - 1 d 

where f is the proportion of total variance due to not observing B values: 

= ff+ATO 

When B is small relative to £/*, the reference distribution for incomplete-data statistics 
differs little from the reference distribution for the corresponding complete-data statistics. This 
is the case with main NAEP reporting variables. If, in addition, d is large, the normal 
approximation can be used to flag "significant" results. 



166 






For Ac-dimensional t, such as the k coefficients in a multiple regression analysis, each L/„ 
and C7* is a covariance matrix, and B is an average of squares and cross-products rather than 
simply an average of squares. In this case, the quantity (T-t*) V ‘ (T-t*)’ is approximately F 
distributed, with degrees of freedom equal to k and i>, with defined as above but wiih a matrix 
generalization of /: 



f = Trace (BV')/k . 

By the same reasoning as used for the normal approximation for scalar t, a chi-square 
distribution on k degrees of freedom often suffices. 



8,53 Biases in Secondary Analyses 

Statistics t’ that involve proficiencies in a scaled content area and variables included in 
the conditioning variables / are consistent estimates of the corresponding population values T. 
Statistics involving background variables y that were not conditioned on, or relationships among 
proficiencies from different content areas, are subject to asymptotic biases whose magnitudes 
depend on the type of statistic and the strength of the relationships of the nonconditioned 
background variables to the variables that were conditioned on and to the proficiency of interest. 
That is, the large sample expectations of certain sample statistics need not equal the true 
population parameters. 

The direction of the bias is typically to underestimate the effect of nonconditioned 
variables. For details and derivations see Beaton and Johnson (1990), Mislevy (1991), and 
Mislevy and Sheehan (1987, section 10.3.5). For a given statistic t’ involving one content area 
and one or more nonconditioned background variables, the magnitude of the bias is related to 
the extent to which observed responses x account for the latent variable 6, and the degree to 
which the nonconditioned background variables are explained by condit.,ining background 
variables. The first factor— conceptually related to test reliability — acts consistently in that 
greater measurement precision reduces biases in all secondary analyses. The second factor acts 
to reduce biases in certain analyses but increase it in others. In particular, 

• High shared variance between conditioned and nonconditioned background 
variables mitigates biases in analyses that involve only proficiency and 
nonconditioned variables, such as marginal means or regressions. 

• High shared variance exacerbates biases in regression coefficients of conditional 
effects for nonconditioned variables, when nonconditioned and conditioned 
background variables are analyzed jointly as in multiple regression. 

The large number of background variables that have been included in the conditioning 
vector for the Trial State Assessment allows a large number of secondary analyses to be carried 
out with little or no bias, and mitigates biases in analyses of the marginal distributions of d in 
nonconditioned variables. Kaplan and Nelson’s analysis of the 1988 NAEP reading data (some 
results of which are summarized in Mislevy, 1991), which had a similar design and fewer 
conditioning variables, indicates that the potential bias for nonconditioned variables in multiple 




I# 



regression analyses is below 10 percent, and biases in simple regression of such variables is 
below 5 percent. Additional research (summarized in Mislevy, 1990) indicates that most of the 
bias reduction obtainable from conditioning on a large number of variables can be captured by 
instead conditioning on the first several principal components of the matrix of all original 
conditioning variables. This procedure was adopted for the Trial State Assessment by replacing 
the conditioning effects by the first K principal components, where K was selected so that 90 
percent of the total variance of the full set of conditioning variables (after standardization) was 
captured. Mislevy (1990) shows that this puts an upper bound of 10 percent on the average bias 
for all analyses involving the original conditioning variables. 



168 



O 

ERIC 



191 




Chapter 9 



DATA ANALYSIS AND SCALING FOR 
THE 1994 TRIAL STATE ASSESSMENT IN READING' 



Nancy L. Allen, John Mazzeo, Eddie H. S. Ip, 
Spencer Swinton, Steven P. Isham, and Lois H. Worthington 

Educational Testing Service 



9.1 OVERVIEW 

This chapter describes the analyses carried out in the development of the 1994 Trial 
State Assessment reading scales. The procedures used were similar to those employed in the 
analysis of the 1992 Trial State Assessment in reading (Allen, Mazzeo, Isham, Fong, & Bowker, 
1994), and the 1990 and 1992 Trial State Assessments in mathematics (Mazzeo, 1991 and 
Mazzeo, Chang, Kulick, Fong, & Grima, 1993) and are based on the philosophical and 
theoretical underpinnings described in the previous chapter. 

There were five major steps in the analysis of the Trial State Assessment reading data, 
each of which is described in a separate section; 

• conventional item and test analyses (section 9.3); 

• item response theory (IRT) scaling (section 9.4); 

• estimation of state and subgroup proficiency distributions based on the "plausible 
values" methodology (section 9.5); 

• linking of the 1994 Trial State Assessment scales to the corresponding scales from 
the 1994 national assessment (section 9.6); and 

• creation of the Trial State Assessment reading composite scale (section 9.7). 

To set the context within which to describe the methods and results of scaling 
procedures, a brief review of the assessment instruments and administration procedures is 
provided. 



'Thanlu to Jame« Csrlion, Huihua Chang. John Donoghue, David Freund, Frank Jenkin*. Laura Jerry, Eugene Johtuon, 
Ed Kulick, Jo-lin Liang, Eiji Muraki, Jennifer Nelson, and Neal Thomas for ihcir help in completing the analysis. Thanks 
also to Angela Grima for her contributions to the original draft of this chapter 



O 

ERIC 




169 




92 



ASSESSMENT INSTRUMENTS AND SCORING 



92.1 Rems, Booklets, and Administration 

The 1994 Trial State Assessment in reading was administered to fourth-grade public- and 
nonpublic-school students. The items in the instruments were based on the curriculum 
framework described in Chapter 2. 

The fourth-grade item pool contained 84 items. They were categorized into one of two 
content areas; 43 were Reading for Literary Experience items and 41 were Reading to Gain 
Information items. These items, 39 of which were multiple-choice items, 37 of which were short 
constructed-response items, and 8 of which were extended constructed-response items, were 
divided into 8 mutually exclusive blocks. The composition of each block of items, in terms of 
content and format, is given in Table 9-1. Note that each block contained items from only one 
of the two content domains. 

The 8 blocks were used to form 16 different booklets according to a partially balanced 
incomplete block (PBIB) design (see Chapter 2 for details). Each of these booklets contained 
two blocks of items, and each block of items appeared in exactly four booklets. To balance 
possible block position effect, each block appeared twice as the first block of reading items and 
twice as the second block. In addition, the design required that each block of items be paired in 
a booklet with every other block of items in the same content domain exactly once. Finally, 
each block of items was included in a booklet with a block of items from the other area. 

Within each administration site, all booklets were "spiraled" together in a random 
sequence and distributed to students sequentially, in the order of the students’ names on the 
Student Listing Form (see Chapter 4). As a result of the partial BIB design and the spiraling of 
booklets, a considerable degree of balance was achieved in the data collection process. Each 
block of items (and, therefore, each item) was administered to randomly equivalent samples of 
students of approximately equal size (i.e., about 4/16 or 1/4 of the total sample size) within 
each jurisdiction and across all jurisdictions. In addition, within and across jurisdictions, 
randomly equivalent samples of approximately equal size received each particular block of items 
as the first or second block within a booklet. 

As described in Chapter 4, a randomly selected half of the administration sessions within 
each jurisdiction that had never participated in a Trial State assessment before were observed by 
Westat-trained quality control monitors. A randomly selected fourth of the administration 
sessions within each jurisdiction that had participated in previous Trial State assessments were 
observed by quality control monitors. Thus, within and across jurisdictions, randomly equivalent 
samples of students received each block of items in a particular position within a booklet under 
monitored and unmonitored administration conditions. 



922 Scoring the Constructed-response Items 

As indicated earlier, the reading assessment included constructed-response items (details 
of the professional scoring process are given in Chapter 5). Response to these items were 
included in the scaling process. 



170 



193 



